Long-Short Transformer: Efficient Transformers for Language and Vision
- URL: http://arxiv.org/abs/2107.02192v1
- Date: Mon, 5 Jul 2021 18:00:14 GMT
- Title: Long-Short Transformer: Efficient Transformers for Language and Vision
- Authors: Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein,
Anima Anandkumar, Bryan Catanzaro
- Abstract summary: Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
- Score: 97.2850205384295
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformers have achieved success in both language and vision domains.
However, it is prohibitively expensive to scale them to long sequences such as
long documents or high-resolution images, because self-attention mechanism has
quadratic time and memory complexities with respect to the input sequence
length. In this paper, we propose Long-Short Transformer (Transformer-LS), an
efficient self-attention mechanism for modeling long sequences with linear
complexity for both language and vision tasks. It aggregates a novel long-range
attention with dynamic projection to model distant correlations and a
short-term attention to capture fine-grained local correlations. We propose a
dual normalization strategy to account for the scale mismatch between the two
attention mechanisms. Transformer-LS can be applied to both autoregressive and
bidirectional models without additional complexity. Our method outperforms the
state-of-the-art models on multiple tasks in language and vision domains,
including the Long Range Arena benchmark, autoregressive language modeling, and
ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on
enwik8 using half the number of parameters than previous method, while being
faster and is able to handle 3$\times$ as long sequences compared to its
full-attention version on the same hardware. On ImageNet, it can obtain the
state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224
ImageNet-1K only), while being more scalable on high-resolution images. The
models and source code will be released soon.
Related papers
- LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Efficient Long-Range Transformers: You Need to Attend More, but Not
Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans.
MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers.
Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z) - Long-Range Transformer Architectures for Document Understanding [1.9331361036118608]
Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.
We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences.
Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
arXiv Detail & Related papers (2023-09-11T14:45:24Z) - LongNet: Scaling Transformers to 1,000,000,000 Tokens [146.4077038371075]
LongNet is a Transformer variant that can scale sequence length to more than 1 billion tokens.
Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
arXiv Detail & Related papers (2023-07-05T17:59:38Z) - A Unified View of Long-Sequence Models towards Modeling Million-Scale
Dependencies [0.0]
We compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation.
We then demonstrate that long context length does yield better performance, albeit application-dependent.
Inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies.
arXiv Detail & Related papers (2023-02-13T09:47:31Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Multi-Scale Vision Longformer: A New Vision Transformer for
High-Resolution Image Encoding [81.07894629034767]
This paper presents a new Vision Transformer (ViT) architecture Multi-Scale Vision Longformer.
It significantly enhances the ViT of citedosovitskiy 2020image for encoding high-resolution images using two techniques.
arXiv Detail & Related papers (2021-03-29T06:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.