FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
- URL: http://arxiv.org/abs/2406.16069v3
- Date: Fri, 04 Oct 2024 19:14:32 GMT
- Title: FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
- Authors: Junyi Zhu, Shuochen Liu, Yu Yu, Bo Tang, Yibo Yan, Zhiyu Li, Feiyu Xiong, Tong Xu, Matthew B. Blaschko,
- Abstract summary: FastMem is a novel method designed to enhance instruction fine-tuned large language models' context awareness.
It maximizes the likelihood of the prompt before inference by updating only the last Feed-Forward Network (FFN) module.
Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures.
- Score: 24.030755262499994
- License:
- Abstract: Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by updating only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem
Related papers
- QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.
We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.
Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context [49.9628075245959]
We present Sentence Variational Autoencoder (SentenceVAE), which includes a Sentence to compress multiple tokens in a sentence into a single token, and a Sentence Decoder to reconstruct it.
The proposed method can accelerate inference speed by 204365%, reduce perplexity (PPL) to 4675% of its original metric, and decrease memory overhead by 8691% for the equivalent context length.
arXiv Detail & Related papers (2024-08-01T15:45:19Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM [24.65339628772433]
SUBLLM is an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules.
During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU.
In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU.
arXiv Detail & Related papers (2024-06-03T16:43:04Z) - Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion [9.55994486328914]
Prompt tuning is a promising method to fine-tune a pre-trained language model without retraining its large-scale parameters.
Existing methods are hard to balance accuracy and efficiency.
A longer (shorter) soft prompt generally leads to a better(worse) accuracy but at the cost of more (less) training time.
We propose an Efficient Prompt Tuning method (EPT) by multi-space projection and prompt fusion.
arXiv Detail & Related papers (2024-05-19T06:43:12Z) - StablePT: Towards Stable Prompting for Few-shot Learning via Input Separation [14.341806875791288]
sysname outperforms state-of-the-art methods by 6.97% in accuracy and reduces the standard deviation by 1.92 on average.
Tests underscore its robustness and stability across 8 datasets covering various tasks.
arXiv Detail & Related papers (2024-04-30T08:01:49Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - CliqueParcel: An Approach For Batching LLM Prompts That Jointly
Optimizes Efficiency And Faithfulness [13.554160815699435]
CliqueParcel is designed to improve efficiency of large language models (LLMs) during the inference process.
CliqueParcel is tested on eight widely recognized datasets.
This work provides novel insights into inference efficiency and demonstrates promising performance.
arXiv Detail & Related papers (2024-02-17T22:37:17Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.