Related papers: Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture

Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture

URL: http://arxiv.org/abs/2311.09406v1
Date: Wed, 15 Nov 2023 22:10:42 GMT
Title: Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture
Authors: James Bernhard
Abstract summary: transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. We propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. This scaling of the dot product is designed to avoid the absolute value of the dot products becoming so large that applying softmax leads to vanishing gradients. In this paper, we propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax. We use simulated keys and queries to show that in many situations this appears to be more effective at avoiding regions where applying softmax leads to vanishing gradients.

Related papers

Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one. We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z)
Scalable-Softmax Is Superior for Attention [0.0]
Transformer-based language models rely on Softmax to compute attention scores. SSMax replaces Softmax in scenarios where the input vector size varies. Models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts.
arXiv Detail & Related papers (2025-01-31T18:55:35Z)
softmax is not enough (for sharp out-of-distribution) [16.167142726585357]
Softmax function is key carrier of sharp behaviour in modern AI systems. For tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
arXiv Detail & Related papers (2024-10-01T22:22:35Z)
Bridging Discrete and Backpropagation: Straight-Through and Beyond [62.46558842476455]
We propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. We propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs.
arXiv Detail & Related papers (2023-04-17T20:59:49Z)
Convex Bounds on the Softmax Function with Applications to Robustness Verification [69.09991317119679]
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well. This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models.
arXiv Detail & Related papers (2023-03-03T05:07:02Z)
A Study on ReLU and Softmax in Transformer [51.0740713922741]
The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories. We first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large.
arXiv Detail & Related papers (2023-02-13T15:41:20Z)
Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks. We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights. Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z)
SubFace: Learning with Softmax Approximation for Face Recognition [3.262192371833866]
SubFace is a softmax approximation method that employs the subspace feature to promote the performance of face recognition. Comprehensive experiments conducted on benchmark datasets demonstrate that our method can significantly improve the performance of vanilla CNN baseline.
arXiv Detail & Related papers (2022-08-24T12:31:08Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems. In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs. We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems. Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z)
Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism [8.007523868483085]
Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax. Our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants.
arXiv Detail & Related papers (2021-08-16T15:26:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.