FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
- URL: http://arxiv.org/abs/2406.16069v1
- Date: Sun, 23 Jun 2024 10:36:35 GMT
- Title: FastMem: Fast Memorization of Prompt Improves Context Awareness of Large Language Models
- Authors: Junyi Zhu, Shuochen Liu, Yu Yu, Bo Tang, Yibo Yan, Zhiyu Li, Feiyu Xiong, Tong Xu, Matthew B. Blaschko,
- Abstract summary: We introduce FastMem, a novel method to enhance instruction fine-tuned large language models' context awareness.
FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module.
Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures.
- Score: 24.030755262499994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) excel in generating coherent text, but they often struggle with context awareness, leading to inaccuracies in tasks requiring faithful adherence to provided information. We introduce FastMem, a novel method designed to enhance instruction fine-tuned LLMs' context awareness through fast memorization of the prompt. FastMem maximizes the likelihood of the prompt before inference by fine-tuning only the last Feed-Forward Network (FFN) module. This targeted approach ensures efficient optimization without overfitting, significantly improving the model's ability to comprehend and accurately follow the context. Our experiments demonstrate substantial gains in reading comprehension, text summarization and adherence to output structures. For instance, FastMem improves the accuracy of Llama 3-8B-Inst on the NQ-SWAP dataset from 59.1% to 71.6%, and reduces the output structure failure rate of Qwen 1.5-4B-Chat from 34.9% to 25.5%. Extensive experimental results highlight FastMem's potential to offer a robust solution to enhance the reliability and accuracy of LLMs in various applications. Our code is available at: https://github.com/IAAR-Shanghai/FastMem
Related papers
- LiveMind: Low-latency Large Language Models with Simultaneous Inference [9.795240210326346]
We introduce a novel low-latency inference framework for large language models (LLMs) inference.
By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency.
For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
arXiv Detail & Related papers (2024-06-20T13:52:30Z) - On the Worst Prompt Performance of Large Language Models [93.13542053835542]
Performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts.
We introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries.
Experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance.
arXiv Detail & Related papers (2024-06-08T13:40:38Z) - SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM [24.65339628772433]
SUBLLM is an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules.
During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU.
In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU.
arXiv Detail & Related papers (2024-06-03T16:43:04Z) - Efficient Prompt Tuning by Multi-Space Projection and Prompt Fusion [9.55994486328914]
Prompt tuning is a promising method to fine-tune a pre-trained language model without retraining its large-scale parameters.
Existing methods are hard to balance accuracy and efficiency.
A longer (shorter) soft prompt generally leads to a better(worse) accuracy but at the cost of more (less) training time.
We propose an Efficient Prompt Tuning method (EPT) by multi-space projection and prompt fusion.
arXiv Detail & Related papers (2024-05-19T06:43:12Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [86.91360597169563]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We propose an amortized feature extraction and memory-augmentation approach to compress and extract information from new documents.
Our experiment demonstrates the superiority of MAC in multiple aspects, including online adaptation performance, time, and memory efficiency.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - CliqueParcel: An Approach For Batching LLM Prompts That Jointly
Optimizes Efficiency And Faithfulness [13.554160815699435]
CliqueParcel is designed to improve efficiency of large language models (LLMs) during the inference process.
CliqueParcel is tested on eight widely recognized datasets.
This work provides novel insights into inference efficiency and demonstrates promising performance.
arXiv Detail & Related papers (2024-02-17T22:37:17Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.