Scalable-Softmax Is Superior for Attention
- URL: http://arxiv.org/abs/2501.19399v1
- Date: Fri, 31 Jan 2025 18:55:35 GMT
- Title: Scalable-Softmax Is Superior for Attention
- Authors: Ken M. Nakanishi,
- Abstract summary: Transformer-based language models rely on Softmax to compute attention scores.<n>SSMax replaces Softmax in scenarios where the input vector size varies.<n>Models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to flatten as the context size grows. This reduces the model's ability to prioritize key information effectively and potentially limits its length generalization. To address this problem, we propose Scalable-Softmax (SSMax), which replaces Softmax in scenarios where the input vector size varies. SSMax can be seamlessly integrated into existing Transformer-based architectures. Experimental results in language modeling show that models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts and key information retrieval. Furthermore, an analysis of attention scores reveals that SSMax enables the model to focus attention on key information even in long contexts. Additionally, although models that use SSMax from the beginning of pretraining achieve better length generalization, those that have already started pretraining can still gain some of this ability by replacing Softmax in the attention layers with SSMax, either during or after pretraining.
Related papers
- Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.
This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation.
Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one.
We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z) - MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms.<n>We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.<n>We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z) - ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters [14.029865087214436]
Self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks.
achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention.
We propose Constant Softmax (ConSmax), a software- hardware co-design that serves as an efficient alternative to Softmax.
arXiv Detail & Related papers (2024-01-31T17:52:52Z) - Alternatives to the Scaled Dot Product for Attention in the Transformer
Neural Network Architecture [0.0]
transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax.
We propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax.
arXiv Detail & Related papers (2023-11-15T22:10:42Z) - Zero-th Order Algorithm for Softmax Attention Optimization [21.631643446337737]
We present a Zero-th Order algorithm specifically tailored for Softmax optimization.
We demonstrate the convergence of our algorithm, highlighting its effectiveness in efficiently computing gradients for large-scale language models.
arXiv Detail & Related papers (2023-07-17T09:43:50Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - Spectral Aware Softmax for Visible-Infrared Person Re-Identification [123.69049942659285]
Visible-infrared person re-identification (VI-ReID) aims to match specific pedestrian images from different modalities.
Existing methods still follow the softmax loss training paradigm, which is widely used in single-modality classification tasks.
We propose the spectral-aware softmax (SA-Softmax) loss, which can fully explore the embedding space with the modality information.
arXiv Detail & Related papers (2023-02-03T02:57:18Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems.
In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.