Related papers: Rethinking Softmax: Self-Attention with Polynomial Activations

Rethinking Softmax: Self-Attention with Polynomial Activations

URL: http://arxiv.org/abs/2410.18613v1
Date: Thu, 24 Oct 2024 10:08:25 GMT
Title: Rethinking Softmax: Self-Attention with Polynomial Activations
Authors: Hemanth Saratchandran, Jianqiao Zheng, Yiping Ji, Wenbo Zhang, Simon Lucey,
Abstract summary: We show that softmax attention in transformers can implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, making them suitable for attention-based architectures.
Score: 25.162734407461905
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper challenges the conventional belief that softmax attention in transformers is effective primarily because it generates a probability distribution for attention allocation. Instead, we theoretically show that its success lies in its ability to implicitly regularize the Frobenius norm of the attention matrix during training. We then explore alternative activations that regularize the Frobenius norm of the attention matrix, demonstrating that certain polynomial activations can achieve this effect, making them suitable for attention-based architectures. Empirical results indicate these activations perform comparably or better than softmax across various computer vision and language tasks, suggesting new possibilities for attention mechanisms beyond softmax.

Related papers

Sigmoid Self-Attention is Better than Softmax Self-Attention: A Mixture-of-Experts Perspective [69.72942835553228]
This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than its softmax counterpart. We show that ''experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
arXiv Detail & Related papers (2025-02-01T02:36:14Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention. We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix [17.086679273053853]
Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives. Their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging. This paper introduces a novel approach to LLM weight pruning that directly optimize for approximating the attention matrix.
arXiv Detail & Related papers (2024-10-15T04:35:56Z)
softmax is not enough (for sharp out-of-distribution) [16.167142726585357]
Softmax function is key carrier of sharp behaviour in modern AI systems. For tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
arXiv Detail & Related papers (2024-10-01T22:22:35Z)
Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond [32.734716767055836]
This paper provides a theoretical study of the optimization and generalization properties of two-layer softmax neural networks. We show that softmax neural networks can learn the target function in the over-parametrization regime. Our work paves the way for further advancements in natural language processing and beyond.
arXiv Detail & Related papers (2024-05-06T08:15:29Z)
Linear Log-Normal Attention with Unbiased Concentration [3.034257650900382]
We study the self-attention mechanism by analyzing the distribution of the attention matrix and its concentration ability. We propose instruments to measure these quantities and introduce a novel self-attention mechanism, Linear Log-Normal Attention. Our experimental results on popular natural language benchmarks reveal that our proposed Linear Log-Normal Attention outperforms other linearized attention alternatives.
arXiv Detail & Related papers (2023-11-22T17:30:41Z)
Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention [28.98187418889448]
Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. The attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity.
arXiv Detail & Related papers (2023-10-18T03:17:57Z)
Unitary Approximate Message Passing for Matrix Factorization [90.84906091118084]
We consider matrix factorization (MF) with certain constraints, which finds wide applications in various areas. We develop a Bayesian approach to MF with an efficient message passing implementation, called UAMPMF. We show that UAMPMF significantly outperforms state-of-the-art algorithms in terms of recovery accuracy, robustness and computational complexity.
arXiv Detail & Related papers (2022-07-31T12:09:32Z)
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z)
Learning Self-Modulating Attention in Continuous Time Space with Applications to Sequential Recommendation [102.24108167002252]
We propose a novel attention network, named self-modulating attention, that models the complex and non-linearly evolving dynamic user preferences. We empirically demonstrate the effectiveness of our method on top-N sequential recommendation tasks, and the results on three large-scale real-world datasets show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-30T03:54:11Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.