Alternatives to the Scaled Dot Product for Attention in the Transformer
Neural Network Architecture
- URL: http://arxiv.org/abs/2311.09406v1
- Date: Wed, 15 Nov 2023 22:10:42 GMT
- Title: Alternatives to the Scaled Dot Product for Attention in the Transformer
Neural Network Architecture
- Authors: James Bernhard
- Abstract summary: transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax.
We propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transformer neural network architecture uses a form of attention in which
the dot product of query and key is divided by the square root of the key
dimension before applying softmax. This scaling of the dot product is designed
to avoid the absolute value of the dot products becoming so large that applying
softmax leads to vanishing gradients. In this paper, we propose some
alternative scalings, including dividing the dot product instead by the sum of
the key lengths before applying softmax. We use simulated keys and queries to
show that in many situations this appears to be more effective at avoiding
regions where applying softmax leads to vanishing gradients.
Related papers
- Scalable-Softmax Is Superior for Attention [0.0]
Transformer-based language models rely on Softmax to compute attention scores.
SSMax replaces Softmax in scenarios where the input vector size varies.
Models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts.
arXiv Detail & Related papers (2025-01-31T18:55:35Z) - softmax is not enough (for sharp out-of-distribution) [16.167142726585357]
Softmax function is key carrier of sharp behaviour in modern AI systems.
For tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time.
We propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
arXiv Detail & Related papers (2024-10-01T22:22:35Z) - Bridging Discrete and Backpropagation: Straight-Through and Beyond [62.46558842476455]
We propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables.
We propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs.
arXiv Detail & Related papers (2023-04-17T20:59:49Z) - Convex Bounds on the Softmax Function with Applications to Robustness
Verification [69.09991317119679]
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well.
This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models.
arXiv Detail & Related papers (2023-03-03T05:07:02Z) - A Study on ReLU and Softmax in Transformer [51.0740713922741]
The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories.
We first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax.
In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large.
arXiv Detail & Related papers (2023-02-13T15:41:20Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems.
In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Breaking the Softmax Bottleneck for Sequential Recommender Systems with
Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs.
We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems.
Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z) - Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in
Attention Mechanism [8.007523868483085]
Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms.
In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax.
Our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants.
arXiv Detail & Related papers (2021-08-16T15:26:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.