Related papers: CoLT5: Faster Long-Range Transformers with Conditional Computation

CoLT5: Faster Long-Range Transformers with Conditional Computation

URL: http://arxiv.org/abs/2303.09752v3
Date: Tue, 24 Oct 2023 00:51:49 GMT
Title: CoLT5: Faster Long-Range Transformers with Conditional Computation
Authors: Joshua Ainslie, Tao Lei, Michiel de Jong, Santiago Onta\~n\'on, Siddhartha Brahma, Yury Zemlyanskiy, David Uthus, Mandy Guo, James Lee-Thorp, Yi Tay, Yun-Hsuan Sung, Sumit Sanghai
Abstract summary: We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference. CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
Score: 65.83586041097763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers. We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
MrT5: Dynamic Token Merging for Efficient Byte-level Language Models [50.46453950887946]
This work introduces MrT5 (MergeT5), a more efficient variant of ByT5. MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length. When trained on English text, MrT5 demonstrates the capability to transfer its deletion feature zero-shot across several languages.
arXiv Detail & Related papers (2024-10-28T06:14:12Z)
Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation [61.305218287797025]
An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. We find that the T5 family deserves a closer look, as its positional embeddings capture rich and flexible attention patterns. We propose two attention alignment strategies via temperature scaling to alleviate the issue.
arXiv Detail & Related papers (2023-11-01T17:43:35Z)
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z)
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
Investigating Efficiently Extending Transformers for Long Input Summarization [37.622021824791254]
We investigate what model architectural changes and pretraining paradigms can most efficiently adapt a pretrained Transformer for long input summarization. We find that a staggered, block-local Transformer with global tokens strikes a good balance of performance and efficiency. We introduce PEG-X, an extension of the PEG model with additional long input pretraining to handle inputs up to 16K tokens.
arXiv Detail & Related papers (2022-08-08T18:10:58Z)
The NLP Task Effectiveness of Long-Range Transformers [38.46467445144777]
Transformer models cannot easily scale to long sequences due to their O(N2) time and space complexity. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks.
arXiv Detail & Related papers (2022-02-16T04:39:35Z)
LongT5: Efficient Text-To-Text Transformer for Long Sequences [8.743996838160825]
We present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.
arXiv Detail & Related papers (2021-12-15T06:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.