cosFormer: Rethinking Softmax in Attention
- URL: http://arxiv.org/abs/2202.08791v1
- Date: Thu, 17 Feb 2022 17:53:48 GMT
- Title: cosFormer: Rethinking Softmax in Attention
- Authors: Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv,
Junjie Yan, Lingpeng Kong, Yiran Zhong
- Abstract summary: kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
- Score: 60.557869510885205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer has shown great successes in natural language processing,
computer vision, and audio processing. As one of its core components, the
softmax attention helps to capture long-range dependencies yet prohibits its
scale-up due to the quadratic space and time complexity to the sequence length.
Kernel methods are often adopted to reduce the complexity by approximating the
softmax operator. Nevertheless, due to the approximation errors, their
performances vary in different tasks/corpus and suffer crucial performance
drops when compared with the vanilla softmax attention. In this paper, we
propose a linear transformer called cosFormer that can achieve comparable or
better accuracy to the vanilla transformer in both casual and cross attentions.
cosFormer is based on two key properties of softmax attention: i).
non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme
that can concentrate the distribution of the attention matrix. As its linear
substitute, cosFormer fulfills these properties with a linear operator and a
cosine-based distance re-weighting mechanism. Extensive experiments on language
modeling and text understanding tasks demonstrate the effectiveness of our
method. We further examine our method on long sequences and achieve
state-of-the-art performance on the Long-Range Arena benchmark. The source code
is available at https://github.com/OpenNLPLab/cosFormer.
Related papers
- Cottention: Linear Transformers With Cosine Attention [2.762180345826837]
We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity.
Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention.
arXiv Detail & Related papers (2024-09-27T13:38:36Z) - Efficient Long-Range Transformers: You Need to Attend More, but Not
Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans.
MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers.
Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Luna: Linear Unified Nested Attention [71.66026714473482]
We propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions.
Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function.
As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly.
arXiv Detail & Related papers (2021-06-03T01:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.