CoLT5: Faster Long-Range Transformers with Conditional Computation
- URL: http://arxiv.org/abs/2303.09752v3
- Date: Tue, 24 Oct 2023 00:51:49 GMT
- Title: CoLT5: Faster Long-Range Transformers with Conditional Computation
- Authors: Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Onta\~n\'on,
Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp,
Yi Tay, Yun-Hsuan Sung, Sumit Sanghai
- Abstract summary: We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference.
CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
- Score: 65.83586041097763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many natural language processing tasks benefit from long inputs, but
processing long documents with Transformers is expensive -- not only due to
quadratic attention complexity but also from applying feedforward and
projection layers to every token. However, not all tokens are equally
important, especially for longer documents. We propose CoLT5, a long-input
Transformer model that builds on this intuition by employing conditional
computation, devoting more resources to important tokens in both feedforward
and attention layers. We show that CoLT5 achieves stronger performance than
LongT5 with much faster training and inference, achieving SOTA on the
long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably
make use of extremely long inputs, showing strong gains up to 64k input length.
Related papers
- MrT5: Dynamic Token Merging for Efficient Byte-level Language Models [50.46453950887946]
This work introduces MrT5 (MergeT5), a more efficient variant of ByT5.
MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length.
When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages.
arXiv Detail & Related papers (2024-10-28T06:14:12Z) - Attention Alignment and Flexible Positional Embeddings Improve
Transformer Length Extrapolation [61.305218287797025]
An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning.
We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns.
We propose two attention alignment strategies via temperature scaling to alleviate the issue.
arXiv Detail & Related papers (2023-11-01T17:43:35Z) - Efficient Long-Range Transformers: You Need to Attend More, but Not
Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans.
MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers.
Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z) - Investigating Efficiently Extending Transformers for Long Input
Summarization [37.622021824791254]
We investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization.
We find that a staggered, block-local Transformer with global tokens strikes a good balance of performance and efficiency.
We introduce PEG-X, an extension of the PEG model with additional long input pretraining to handle inputs up to 16K tokens.
arXiv Detail & Related papers (2022-08-08T18:10:58Z) - The NLP Task Effectiveness of Long-Range Transformers [38.46467445144777]
Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity.
We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets.
We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
arXiv Detail & Related papers (2022-02-16T04:39:35Z) - LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time.
We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.