Focused Transformer: Contrastive Training for Context Scaling
- URL: http://arxiv.org/abs/2307.03170v2
- Date: Thu, 30 Nov 2023 17:15:34 GMT
- Title: Focused Transformer: Contrastive Training for Context Scaling
- Authors: Szymon Tworkowski, Konrad Staniszewski, Miko{\l}aj Pacek, Yuhuai Wu,
Henryk Michalewski, Piotr Mi{\l}o\'s
- Abstract summary: We introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning.
FoT enhances the structure of the (key, value) space, enabling an extension of the context length.
Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context.
- Score: 31.44508996359732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have an exceptional capability to incorporate new
information in a contextual manner. However, the full potential of such an
approach is often restrained due to a limitation in the effective context
length. One solution to this issue is to endow an attention layer with access
to an external memory, which comprises of (key, value) pairs. Yet, as the
number of documents increases, the proportion of relevant keys to irrelevant
ones decreases, leading the model to focus more on the irrelevant keys. We
identify a significant challenge, dubbed the distraction issue, where keys
linked to different semantic values might overlap, making them hard to
distinguish. To tackle this problem, we introduce the Focused Transformer
(FoT), a technique that employs a training process inspired by contrastive
learning. This novel approach enhances the structure of the (key, value) space,
enabling an extension of the context length. Our method allows for fine-tuning
pre-existing, large-scale models to lengthen their effective context. This is
demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The
resulting models, which we name LongLLaMA, exhibit advancements in tasks
requiring a long context. We further illustrate that our LongLLaMA models
adeptly manage a $256 k$ context length for passkey retrieval.
Related papers
- Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.
These massive values play a critical role in interpreting contextual knowledge.
We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.
Perplexity (PPL) has proven unreliable for assessing long-context capabilities.
We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling [42.67141329779589]
Grouped Cross Attention can generalize to 1000 times the pre-training context length.
Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths.
arXiv Detail & Related papers (2024-10-02T15:18:34Z) - Writing in the Margins: Better Inference Pattern for Long Context Retrieval [0.9404560827144429]
Writing in the Margins (WiM) is an inference pattern designed to optimize the handling of long input sequences in retrieval-oriented tasks.
We show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing.
arXiv Detail & Related papers (2024-08-27T09:34:38Z) - FocusLLM: Precise Understanding of Long Context by Dynamic Condensing [16.642675785000176]
FocusLLM is a framework designed to extend the fixed context length of any decoder-only LLM.
It employs the dynamic condensing process to distill crucial information from each chunk.
Ultimately, through the novel parallel decoding mechanism, FocusLLM can integrate the extracted information into its local context.
arXiv Detail & Related papers (2024-08-21T16:11:59Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - Training With "Paraphrasing the Original Text" Improves Long-Context Performance [19.48556587305737]
Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs.
We propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context.
Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores.
arXiv Detail & Related papers (2023-12-18T13:40:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.