Related papers: r-softmax: Generalized Softmax with Controllable Sparsity Rate

r-softmax: Generalized Softmax with Controllable Sparsity Rate

URL: http://arxiv.org/abs/2304.05243v3
Date: Fri, 21 Apr 2023 14:41:43 GMT
Title: r-softmax: Generalized Softmax with Controllable Sparsity Rate
Authors: Klaudia Ba{\l}azy, {\L}ukasz Struski, Marek \'Smieja, Jacek Tabor
Abstract summary: We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate. We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
Score: 11.39524236962986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays artificial neural network models achieve remarkable results in many disciplines. Functions mapping the representation provided by the model to the probability distribution are the inseparable aspect of deep learning solutions. Although softmax is a commonly accepted probability mapping function in the machine learning community, it cannot return sparse outputs and always spreads the positive probability to all positions. In this paper, we propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate. In contrast to the existing sparse probability mapping functions, we provide an intuitive mechanism for controlling the output sparsity level. We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax. We also apply r-softmax to the self-attention module of a pre-trained transformer language model and demonstrate that it leads to improved performance when fine-tuning the model on different natural language processing tasks.

Related papers

Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization [15.458541841436967]
We study the pivotal role of the softmax function in shaping the model's representation.<n>We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes.<n>We demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data.
arXiv Detail & Related papers (2025-06-02T11:38:10Z)
Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one. We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms. We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality. We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z)
Binary Hypothesis Testing for Softmax Models and Leverage Score Models [8.06972158448711]
We consider the problem of binary hypothesis testing in the setting of softmax models. We draw analogy between the softmax model and the leverage score model.
arXiv Detail & Related papers (2024-05-09T15:56:29Z)
Revisiting Logistic-softmax Likelihood in Bayesian Meta-Learning for Few-Shot Classification [4.813254903898101]
logistic-softmax is often employed as an alternative to the softmax likelihood in multi-class Gaussian process classification. We revisit and redesign the logistic-softmax likelihood, which enables control of the textita priori confidence level through a temperature parameter. Our approach yields well-calibrated uncertainty estimates and achieves comparable or superior results on standard benchmark datasets.
arXiv Detail & Related papers (2023-10-16T13:20:13Z)
Spectral Aware Softmax for Visible-Infrared Person Re-Identification [123.69049942659285]
Visible-infrared person re-identification (VI-ReID) aims to match specific pedestrian images from different modalities. Existing methods still follow the softmax loss training paradigm, which is widely used in single-modality classification tasks. We propose the spectral-aware softmax (SA-Softmax) loss, which can fully explore the embedding space with the modality information.
arXiv Detail & Related papers (2023-02-03T02:57:18Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation [2.3813678058429626]
The softmax function is widely used in artificial neural networks for the multiclass classification problems. In this paper, we provide an empirical study on a simple and concise softmax variant, namely sparse-softmax, to alleviate the problem that occurred in traditional softmax in terms of high-dimensional classification problems.
arXiv Detail & Related papers (2021-12-23T09:53:38Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs. We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems. Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Gradient Estimation with Stochastic Softmax Tricks [84.68686389163153]
We introduce softmax tricks, which generalize the Gumbel-Softmax trick to spaces. We find that softmax tricks can be used to train latent variable models that perform better and discover more latent structure.
arXiv Detail & Related papers (2020-06-15T00:43:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.