Related papers: cosFormer: Rethinking Softmax in Attention

cosFormer: Rethinking Softmax in Attention

URL: http://arxiv.org/abs/2202.08791v1
Date: Thu, 17 Feb 2022 17:53:48 GMT
Title: cosFormer: Rethinking Softmax in Attention
Authors: Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, Yiran Zhong
Abstract summary: kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops. We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
Score: 60.557869510885205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the quadratic space and time complexity to the sequence length. Kernel methods are often adopted to reduce the complexity by approximating the softmax operator. Nevertheless, due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops when compared with the vanilla softmax attention. In this paper, we propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer in both casual and cross attentions. cosFormer is based on two key properties of softmax attention: i). non-negativeness of the attention matrix; ii). a non-linear re-weighting scheme that can concentrate the distribution of the attention matrix. As its linear substitute, cosFormer fulfills these properties with a linear operator and a cosine-based distance re-weighting mechanism. Extensive experiments on language modeling and text understanding tasks demonstrate the effectiveness of our method. We further examine our method on long sequences and achieve state-of-the-art performance on the Long-Range Arena benchmark. The source code is available at https://github.com/OpenNLPLab/cosFormer.

Related papers

Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention. We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Cottention: Linear Transformers With Cosine Attention [2.762180345826837]
We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention.
arXiv Detail & Related papers (2024-09-27T13:38:36Z)
Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer [36.75562615596186]
We propose MASFormer, an easy-to-implement transformer variant with Mixed Attention Spans. MASFormer is equipped with full attention to capture long-range dependencies, but only at a small number of layers. Experiments show that a decoder-only MASFormer model of 1.3B parameters can achieve competitive performance to vanilla transformers with full attention.
arXiv Detail & Related papers (2023-10-19T03:32:05Z)
FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity. Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead. We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
Luna: Linear Unified Nested Attention [71.66026714473482]
We propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly.
arXiv Detail & Related papers (2021-06-03T01:47:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.