Focused Transformer: Contrastive Training for Context Scaling
- URL: http://arxiv.org/abs/2307.03170v2
- Date: Thu, 30 Nov 2023 17:15:34 GMT
- Title: Focused Transformer: Contrastive Training for Context Scaling
- Authors: Szymon Tworkowski, Konrad Staniszewski, Miko{\l}aj Pacek, Yuhuai Wu,
Henryk Michalewski, Piotr Mi{\l}o\'s
- Abstract summary: We introduce the Focused Transformer (FoT), a technique that employs a training process inspired by contrastive learning.
FoT enhances the structure of the (key, value) space, enabling an extension of the context length.
Our method allows for fine-tuning pre-existing, large-scale models to lengthen their effective context.
- Score: 31.44508996359732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have an exceptional capability to incorporate new
information in a contextual manner. However, the full potential of such an
approach is often restrained due to a limitation in the effective context
length. One solution to this issue is to endow an attention layer with access
to an external memory, which comprises of (key, value) pairs. Yet, as the
number of documents increases, the proportion of relevant keys to irrelevant
ones decreases, leading the model to focus more on the irrelevant keys. We
identify a significant challenge, dubbed the distraction issue, where keys
linked to different semantic values might overlap, making them hard to
distinguish. To tackle this problem, we introduce the Focused Transformer
(FoT), a technique that employs a training process inspired by contrastive
learning. This novel approach enhances the structure of the (key, value) space,
enabling an extension of the context length. Our method allows for fine-tuning
pre-existing, large-scale models to lengthen their effective context. This is
demonstrated by our fine-tuning of $3B$ and $7B$ OpenLLaMA checkpoints. The
resulting models, which we name LongLLaMA, exhibit advancements in tasks
requiring a long context. We further illustrate that our LongLLaMA models
adeptly manage a $256 k$ context length for passkey retrieval.
Related papers
- KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [49.43759617227999]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks.
By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions.
By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z) - Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training.
By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z) - Training With "Paraphrasing the Original Text'' Improves Long-Context Performance [0.0]
Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs.
This paper identifies the root of these issues as a deficiency in retrieval capabilities, exacerbated by the sparsity of key information in long contexts.
We introduce a novel approach called Paraphrasing the Original Text'', aimed at augmenting LLMs' proficiency in extracting information from long context.
arXiv Detail & Related papers (2023-12-18T13:40:16Z) - Jaeger: A Concatenation-Based Multi-Transformer VQA Model [0.13654846342364307]
Document-based Visual Question Answering poses a challenging task between linguistic sense disambiguation and fine-grained multimodal retrieval.
We propose Jaegar, a concatenation-based multi-transformer VQA model.
Our approach has the potential to amplify the performance of these models through concatenation.
arXiv Detail & Related papers (2023-10-11T00:14:40Z) - Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.