LongT5: Efficient Text-To-Text Transformer for Long Sequences
- URL: http://arxiv.org/abs/2112.07916v1
- Date: Wed, 15 Dec 2021 06:35:29 GMT
- Title: LongT5: Efficient Text-To-Text Transformer for Long Sequences
- Authors: Mandy Guo, Joshua Ainslie, David Uthus, Santiago Ontanon, Jianmo Ni,
Yun-Hsuan Sung, Yinfei Yang
- Abstract summary: We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time.
We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
- Score: 8.743996838160825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work has shown that either (1) increasing the input length or (2)
increasing model size can improve the performance of Transformer-based neural
models. In this paper, we present a new model, called LongT5, with which we
explore the effects of scaling both the input length and model size at the same
time. Specifically, we integrated attention ideas from long-input transformers
(ETC), and adopted pre-training strategies from summarization pre-training
(PEGASUS) into the scalable T5 architecture. The result is a new attention
mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's
local/global attention mechanism, but without requiring additional side-inputs.
We are able to achieve state-of-the-art results on several summarization tasks
and outperform the original T5 models on question answering tasks.
Related papers
- Weighted Grouped Query Attention in Transformers [0.0]
We propose a variation of Grouped-Query Attention termed Weighted Grouped-Query Attention (WGQA)
We introduce new learnable parameters for each key and value head in the T5 decoder attention blocks, enabling the model to take a weighted average during finetuning.
Our model achieves an average of 0.53% improvement over GQA, and converges to traditional Multihead attention (MHA) with no additional overhead during inference.
arXiv Detail & Related papers (2024-07-15T16:07:13Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Attention Alignment and Flexible Positional Embeddings Improve
Transformer Length Extrapolation [61.305218287797025]
An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning.
We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns.
We propose two attention alignment strategies via temperature scaling to alleviate the issue.
arXiv Detail & Related papers (2023-11-01T17:43:35Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Model-Generated Pretraining Signals Improves Zero-Shot Generalization of
Text-to-Text Transformers [98.30298332661323]
This paper explores the effectiveness of model-generated signals in improving zero-shot generalization of text-to-text Transformers such as T5.
We develop a new model, METRO-T0, which is pretrained using the redesigned ELECTRA-Style pretraining strategies and then prompt-finetuned on a mixture of NLP tasks.
Our analysis on model's neural activation and parameter sensitivity reveals that the effectiveness of METRO-T0 stems from more balanced contribution of parameters and better utilization of their capacity.
arXiv Detail & Related papers (2023-05-21T21:06:23Z) - Investigating Efficiently Extending Transformers for Long Input
Summarization [37.622021824791254]
We investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization.
We find that a staggered, block-local Transformer with global tokens strikes a good balance of performance and efficiency.
We introduce PEG-X, an extension of the PEG model with additional long input pretraining to handle inputs up to 16K tokens.
arXiv Detail & Related papers (2022-08-08T18:10:58Z) - Scale Efficiently: Insights from Pre-training and Fine-tuning
Transformers [57.931830650323]
This paper presents scaling insights from pretraining and finetuning Transformers.
We show that aside from only the model size, model shape matters for downstream fine-tuning.
We present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality.
arXiv Detail & Related papers (2021-09-22T12:29:15Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Longformer: The Long-Document Transformer [40.18988262517733]
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
We introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer.
Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention.
arXiv Detail & Related papers (2020-04-10T17:54:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.