Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in
Attention Mechanism
- URL: http://arxiv.org/abs/2108.07153v1
- Date: Mon, 16 Aug 2021 15:26:31 GMT
- Title: Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in
Attention Mechanism
- Authors: Shulun Wang, Bin Liu and Feng Liu
- Abstract summary: Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms.
In this work, we suggest that replacing the exponential function by periodic functions, and we delve into some potential periodic alternatives of Softmax.
Our method is proved to be able to alleviate the gradient problem and yield substantial improvements compared to Softmax and its variants.
- Score: 8.007523868483085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Softmax is widely used in neural networks for multiclass classification, gate
structure and attention mechanisms. The statistical assumption that the input
is normal distributed supports the gradient stability of Softmax. However, when
used in attention mechanisms such as transformers, since the correlation scores
between embeddings are often not normally distributed, the gradient vanishing
problem appears, and we prove this point through experimental confirmation. In
this work, we suggest that replacing the exponential function by periodic
functions, and we delve into some potential periodic alternatives of Softmax
from the view of value and gradient. Through experiments on a simply designed
demo referenced to LeViT, our method is proved to be able to alleviate the
gradient problem and yield substantial improvements compared to Softmax and its
variants. Further, we analyze the impact of pre-normalization for Softmax and
our methods through mathematics and experiments. Lastly, we increase the depth
of the demo and prove the applicability of our method in deep structures.
Related papers
- Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? [27.924615931679757]
We explore the impacts of a dense-to-sparse gating mixture of experts (MoE) on the maximum likelihood estimation under the MoE.
We propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function.
arXiv Detail & Related papers (2024-01-25T01:09:09Z) - Achieving Margin Maximization Exponentially Fast via Progressive Norm
Rescaling [7.6730288475318815]
We investigate margin-maximization bias by gradient-based algorithms in classifying linearly separable data.
We propose a novel algorithm called Progressive Rescaling Gradient (PRGD) and show that PRGD can maximize the margin at an em exponential rate
PRGD also shows promise in enhancing the generalization performance when applied to linearly non-separable datasets and deep neural networks.
arXiv Detail & Related papers (2023-11-24T10:07:10Z) - Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders.
Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency.
We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z) - Bridging Discrete and Backpropagation: Straight-Through and Beyond [62.46558842476455]
We propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables.
We propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs.
arXiv Detail & Related papers (2023-04-17T20:59:49Z) - Convex Bounds on the Softmax Function with Applications to Robustness
Verification [69.09991317119679]
The softmax function is a ubiquitous component at the output of neural networks and increasingly in intermediate layers as well.
This paper provides convex lower bounds and concave upper bounds on the softmax function, which are compatible with convex optimization formulations for characterizing neural networks and other ML models.
arXiv Detail & Related papers (2023-03-03T05:07:02Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Enhancing Classifier Conservativeness and Robustness by Polynomiality [23.099278014212146]
We show howconditionality can remedy the situation.
A directly related, simple, yet important technical novelty we subsequently present is softRmax.
We show that two aspects of softRmax, conservativeness and inherent robustness, lead to adversarial regularization.
arXiv Detail & Related papers (2022-03-23T19:36:19Z) - Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems.
In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial
Attacks [86.88061841975482]
We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle.
We use this setting to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method(FGSM)
We show that the method uses fewer queries and achieves higher attack success rates than the current state of the art.
arXiv Detail & Related papers (2020-10-08T18:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.