SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
- URL: http://arxiv.org/abs/2510.17189v1
- Date: Mon, 20 Oct 2025 06:09:09 GMT
- Title: SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference
- Authors: Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, Yongpan Liu,
- Abstract summary: We present SOLE, a hardware-software co-design for Softmax and LayerNorm.<n>We achieve both low-precision calculation and low bit-width storage on Softmax and LayerNorm.
- Score: 6.157559748568282
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have shown remarkable performance in both natural language processing (NLP) and computer vision (CV) tasks. However, their real-time inference speed and efficiency are limited due to the inefficiency in Softmax and Layer Normalization (LayerNorm). Previous works based on function approximation suffer from inefficient implementation as they place emphasis on computation while disregarding memory overhead concerns. Moreover, such methods rely on retraining to compensate for approximation error which can be costly and inconvenient. In this paper, we present SOLE, a hardware-software co-design for Softmax and LayerNorm which is composed of E2Softmax and AILayerNorm. E2Softmax utilizes log2 quantization of exponent function and log-based division to approximate Softmax while AILayerNorm adopts low-precision statistic calculation. Compared with state-of-the-art designs, we achieve both low-precision calculation and low bit-width storage on Softmax and LayerNorm. Experiments show that SOLE maintains inference accuracy without retraining while offering orders of magnitude speedup and energy savings over GPU, achieving 3.04x, 3.86x energy-efficiency improvements and 2.82x, 3.32x area-efficiency improvements over prior state-of-the-art custom hardware for Softmax and LayerNorm, respectively.
Related papers
- Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z) - Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one.<n>We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z) - AdaSplash: Adaptive Sparse Flash Attention [20.28859850361068]
We propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $alpha$-entmax.<n>AdaSplash achieves substantial improvements in runtime and memory efficiency compared to existing $alpha$-entmax implementations.
arXiv Detail & Related papers (2025-02-17T17:56:23Z) - SoftmAP: Software-Hardware Co-design for Integer-Only Softmax on Associative Processors [1.8999662338457695]
Non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization.<n>We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware.
arXiv Detail & Related papers (2024-11-26T20:00:54Z) - ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters [14.029865087214436]
Self-attention mechanism distinguishes transformer-based large language models (LLMs) apart from convolutional and recurrent neural networks.
achieving real-time LLM inference on silicon remains challenging due to the extensive use of Softmax in self-attention.
We propose Constant Softmax (ConSmax), a software- hardware co-design that serves as an efficient alternative to Softmax.
arXiv Detail & Related papers (2024-01-31T17:52:52Z) - Spectral Aware Softmax for Visible-Infrared Person Re-Identification [123.69049942659285]
Visible-infrared person re-identification (VI-ReID) aims to match specific pedestrian images from different modalities.
Existing methods still follow the softmax loss training paradigm, which is widely used in single-modality classification tasks.
We propose the spectral-aware softmax (SA-Softmax) loss, which can fully explore the embedding space with the modality information.
arXiv Detail & Related papers (2023-02-03T02:57:18Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems.
In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Effectiveness of MPC-friendly Softmax Replacement [13.710300609457267]
We analyze the two uses of the softmax replacement and compare them to softmax.
We found that the replacement only provides a significant speed-up for a one-layer network while it always reduces accuracy, sometimes significantly.
arXiv Detail & Related papers (2020-11-23T04:14:32Z) - Optimal Approximation -- Smoothness Tradeoffs for Soft-Max Functions [73.33961743410876]
A soft-max function has two main efficiency measures: approximation and smoothness.
We identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness.
This leads to novel soft-max functions, each of which is optimal for a different application.
arXiv Detail & Related papers (2020-10-22T05:19:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.