Theory, Analysis, and Best Practices for Sigmoid Self-Attention
- URL: http://arxiv.org/abs/2409.04431v1
- Date: Fri, 6 Sep 2024 17:53:26 GMT
- Title: Theory, Analysis, and Best Practices for Sigmoid Self-Attention
- Authors: Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb,
- Abstract summary: We revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis.
We prove that transformers with sigmoid attention are universal function approximators.
We introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention.
- Score: 16.73166377436999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.
Related papers
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration [21.808835887740543]
We propose SageAttention, a highly efficient and accurate quantization method for attention.
Our approach incurs almost no end-to-end metrics loss across diverse models.
arXiv Detail & Related papers (2024-10-03T10:25:23Z) - Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors.
We propose KeyMotion, that generates plausible human motion sequences corresponding to input text.
We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space.
For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Stabilizing Transformer Training by Preventing Attention Entropy
Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers.
We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training.
We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Sinkformers: Transformers with Doubly Stochastic Attention [22.32840998053339]
We use Sinkhorn's algorithm to make attention matrices doubly. We call the resulting model a Sinkformer.
On the experimental side, we show Sinkformers enhance model accuracy in vision and natural language processing tasks.
Importantly, on 3D shapes classification, Sinkformers lead to a significant improvement.
arXiv Detail & Related papers (2021-10-22T13:25:01Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Rethinking Attention with Performers [45.47365397101224]
We introduce Performers, Transformer architectures which can estimate full-rank-attention Transformers with provable accuracy.
Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods.
We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
arXiv Detail & Related papers (2020-09-30T17:09:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.