Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
- URL: http://arxiv.org/abs/2502.00281v2
- Date: Sat, 24 May 2025 02:54:56 GMT
- Title: Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective
- Authors: Fanqi Yan, Huy Nguyen, Pedram Akbarian, Nhat Ho, Alessandro Rinaldo,
- Abstract summary: This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than its softmax counterpart.<n>We represent the self-attention matrix as a mixture of experts and show that experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
- Score: 69.72942835553228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure slows down the attention computation due to its row-wise nature, and it inherently introduces competition among tokens: as the weight assigned to one token increases, the weights of others decrease. This competitive dynamic may narrow the focus of self-attention to a limited set of features, potentially overlooking other informative characteristics. Recent experimental studies have shown that using the element-wise sigmoid function helps eliminate token competition and reduce the computational overhead. Despite these promising empirical results, a rigorous comparison between sigmoid and softmax self-attention mechanisms remains absent in the literature. This paper closes this gap by theoretically demonstrating that sigmoid self-attention is more sample-efficient than its softmax counterpart. Toward that goal, we represent the self-attention matrix as a mixture of experts and show that ``experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
Related papers
- Transformers Learn Faster with Semantic Focus [57.97235825738412]
We study sparse transformers in terms of learnability and generalization.<n>We find that input-dependent sparse attention models appear to converge faster and generalize better than standard attention models.
arXiv Detail & Related papers (2025-06-17T01:19:28Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.
Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.
We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z) - Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.<n>This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.<n>We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z) - Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.
We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.
Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z) - Rethinking Softmax: Self-Attention with Polynomial Activations [25.162734407461905]
We show that softmax attention in transformers can implicitly regularize the Frobenius norm of the attention matrix during training.
We then explore alternative activations that regularize the Frobenius norm of the attention matrix, making them suitable for attention-based architectures.
arXiv Detail & Related papers (2024-10-24T10:08:25Z) - Theory, Analysis, and Best Practices for Sigmoid Self-Attention [16.73166377436999]
We revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis.
We prove that transformers with sigmoid attention are universal function approximators.
We introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention.
arXiv Detail & Related papers (2024-09-06T17:53:26Z) - Sigmoid Gating is More Sample Efficient than Softmax Gating in Mixture of Experts [78.3687645289918]
We show that the sigmoid gating function enjoys a higher sample efficiency than the softmax gating for the statistical task of expert estimation.
We find that experts formulated as feed-forward networks with commonly used activation such as ReLU and GELU enjoy faster convergence rates under the sigmoid gating.
arXiv Detail & Related papers (2024-05-22T21:12:34Z) - Superiority of Softmax: Unveiling the Performance Edge Over Linear
Attention [28.98187418889448]
Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks.
The attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function.
linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity.
arXiv Detail & Related papers (2023-10-18T03:17:57Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Capturing Multi-Resolution Context by Dilated Self-Attention [58.69803243323346]
We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention.
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
arXiv Detail & Related papers (2021-04-07T02:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.