Related papers: Focused Transformer: Contrastive Training for Context Scaling

Focused Transformer: Contrastive Training for Context Scaling

URL: http://arxiv.org/abs/2307.03170v2
Date: Thu, 30 Nov 2023 17:15:34 GMT
Title: Focused Transformer: Contrastive Training for Context Scaling
Authors: Szymon Tworkowski, Konrad Staniszewski, Miko{\l}aj Pacek, Yuhuai Wu, Henryk Michalewski, Piotr Mi{\l}o\'s
Abstract summary: We introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. FoT enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context.
Score: 31.44508996359732
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models have an exceptional capability to incorporate new information in a contextual manner. However, the full potential of such an approach is often restrained due to a limitation in the effective context length. One solution to this issue is to endow an attention layer with access to an external memory, which comprises of (key, value) pairs. Yet, as the number of documents increases, the proportion of relevant keys to irrelevant ones decreases, leading the model to focus more on the irrelevant keys. We identify a significant challenge, dubbed the distraction issue, where keys linked to different semantic values might overlap, making them hard to distinguish. To tackle this problem, we introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning. This novel approach enhances the structure of the (key, value) space, enabling an extension of the context length. Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context. This is demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The resulting models, which we name LongLLaMA, exhibit advancements in tasks requiring a long context. We further illustrate that our LongLLaMA models adeptly manage a $256 k$ context length for passkey retrieval.

Related papers

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries. These massive values play a critical role in interpreting contextual knowledge. We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
ChuLo: Chunk-Level Key Information Representation for Long Document Processing [11.29459225491404]
ChuLo is a novel chunk representation method for long document classification. Our approach minimizes information loss and improves the efficiency of Transformer-based models.
arXiv Detail & Related papers (2024-10-14T22:06:54Z)
Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling [42.67141329779589]
Grouped Cross Attention can generalize to 1000 times the pre-training context length. Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths.
arXiv Detail & Related papers (2024-10-02T15:18:34Z)
Writing in the Margins: Better Inference Pattern for Long Context Retrieval [0.9404560827144429]
Writing in the Margins (WiM) is an inference pattern designed to optimize the handling of long input sequences in retrieval-oriented tasks. We show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing.
arXiv Detail & Related papers (2024-08-27T09:34:38Z)
FocusLLM: Scaling LLM's Context by Parallel Decoding [16.642675785000176]
FocusLLM is a framework designed to extend the context length of any decoder-only LLM. FocusLLM processes long text inputs by dividing them into chunks based on the model's original context length. It appends the local context to each chunk as a prompt to extract essential information from each chunk based on a novel parallel decoding mechanism.
arXiv Detail & Related papers (2024-08-21T16:11:59Z)
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs) This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z)
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z)
Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z)
Training With "Paraphrasing the Original Text" Improves Long-Context Performance [19.48556587305737]
Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs. We propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context. Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores.
arXiv Detail & Related papers (2023-12-18T13:40:16Z)
Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.