ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
- URL: http://arxiv.org/abs/2402.10930v2
- Date: Tue, 20 Feb 2024 09:52:42 GMT
- Title: ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters
- Authors: Shiwei Liu, Guanchen Tao, Yifei Zou, Derek Chow, Zichen Fan, Kauna
Lei, Bangfei Pan, Dennis Sylvester, Gregory Kielian, and Mehdi Saligane
- Abstract summary: Self-attention mechanism sets transformer-based large language model (LLM) apart from convolutional and recurrent neural networks.
achieving real-time LLM inference on silicon is challenging due to the extensively used Softmax in self-attention.
We propose Constant Softmax (ConSmax), a software- hardware co-design as an efficient Softmax alternative.
- Score: 14.029865087214436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The self-attention mechanism sets transformer-based large language model
(LLM) apart from the convolutional and recurrent neural networks. Despite the
performance improvement, achieving real-time LLM inference on silicon is
challenging due to the extensively used Softmax in self-attention. Apart from
the non-linearity, the low arithmetic intensity greatly reduces the processing
parallelism, which becomes the bottleneck especially when dealing with a longer
context. To address this challenge, we propose Constant Softmax (ConSmax), a
software-hardware co-design as an efficient Softmax alternative. ConSmax
employs differentiable normalization parameters to remove the maximum searching
and denominator summation in Softmax. It allows for massive parallelization
while performing the critical tasks of Softmax. In addition, a scalable ConSmax
hardware utilizing a bitwidth-split look-up table (LUT) can produce lossless
non-linear operation and support mix-precision computing. It further
facilitates efficient LLM inference. Experimental results show that ConSmax
achieves a minuscule power consumption of 0.43 mW and area of 0.001 mm2 at
1-GHz working frequency and 22-nm CMOS technology. Compared to state-of-the-art
Softmax hardware, ConSmax results in 14.5x energy and 14.0x area savings with a
comparable accuracy on a GPT-2 model and the WikiText103 dataset.
Related papers
- MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms.
We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.
We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - Spectral Aware Softmax for Visible-Infrared Person Re-Identification [123.69049942659285]
Visible-infrared person re-identification (VI-ReID) aims to match specific pedestrian images from different modalities.
Existing methods still follow the softmax loss training paradigm, which is widely used in single-modality classification tasks.
We propose the spectral-aware softmax (SA-Softmax) loss, which can fully explore the embedding space with the modality information.
arXiv Detail & Related papers (2023-02-03T02:57:18Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems.
In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Breaking the Softmax Bottleneck for Sequential Recommender Systems with
Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs.
We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems.
Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z) - Exploring Alternatives to Softmax Function [0.5924831288313849]
We investigate Taylor softmax, SM-softmax and our proposed SM-Taylor softmax as alternatives to softmax function.
Our experiments for the image classification task on different datasets reveal that there is always a configuration of the SM-Taylor softmax function that outperforms the normal softmax function.
arXiv Detail & Related papers (2020-11-23T16:50:18Z) - Optimal Approximation -- Smoothness Tradeoffs for Soft-Max Functions [73.33961743410876]
A soft-max function has two main efficiency measures: approximation and smoothness.
We identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness.
This leads to novel soft-max functions, each of which is optimal for a different application.
arXiv Detail & Related papers (2020-10-22T05:19:58Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.