Related papers: MultiMax: Sparse and Multi-Modal Attention Learning

MultiMax: Sparse and Multi-Modal Attention Learning

URL: http://arxiv.org/abs/2406.01189v3
Date: Wed, 08 Jan 2025 07:59:53 GMT
Title: MultiMax: Sparse and Multi-Modal Attention Learning
Authors: Yuxuan Zhou, Mario Fritz, Margret Keuper,
Abstract summary: SoftMax is a ubiquitous ingredient of modern machine learning algorithms.<n>We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.<n>We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
Score: 60.49318008131978
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: SoftMax is a ubiquitous ingredient of modern machine learning algorithms. It maps an input vector onto a probability simplex and reweights the input by concentrating the probability mass at large entries. Yet, as a smooth approximation to the Argmax function, a significant amount of probability mass is distributed to other, residual entries, leading to poor interpretability and noise. Although sparsity can be achieved by a family of SoftMax variants, they often require an alternative loss function and do not preserve multi-modality. We show that this trade-off between multi-modality and sparsity limits the expressivity of SoftMax as well as its variants. We provide a solution to this tension between objectives by proposing a piece-wise differentiable function, termed MultiMax, which adaptively modulates the output distribution according to input entry range. Through comprehensive analysis and evaluation, we show that MultiMax successfully produces a distribution that supresses irrelevant entries while preserving multimodality, with benefits in image classification, language modeling and machine translation. The code is available at https://github.com/ZhouYuxuanYX/MultiMax.

Related papers

Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one. We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond [37.96043934146189]
We propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers. In GPT-2, our proposals are significantly better and more efficient than mixture of softmax. Our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
arXiv Detail & Related papers (2023-05-20T21:52:24Z)
r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate. We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models [38.26333732364642]
We present $textitev-softmax$, a sparse normalization function that preserves the multimodality of probability distributions. We evaluate our method on a variety of generative models, including variational autoencoders and auto-regressive architectures.
arXiv Detail & Related papers (2021-10-27T05:32:25Z)
SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing. We identify that their limitations are rooted in keeping the softmax self-attention during approximations. For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z)
Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs. We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems. Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models [76.22217735434661]
This paper introduces two new classes of generative models for categorical data: Argmax Flows and Multinomial Diffusion. We demonstrate that our models perform competitively on language modelling and modelling of image segmentation maps.
arXiv Detail & Related papers (2021-02-10T11:04:17Z)
Effectiveness of MPC-friendly Softmax Replacement [13.710300609457267]
We analyze the two uses of the softmax replacement and compare them to softmax. We found that the replacement only provides a significant speed-up for a one-layer network while it always reduces accuracy, sometimes significantly.
arXiv Detail & Related papers (2020-11-23T04:14:32Z)
Optimal Approximation -- Smoothness Tradeoffs for Soft-Max Functions [73.33961743410876]
A soft-max function has two main efficiency measures: approximation and smoothness. We identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness. This leads to novel soft-max functions, each of which is optimal for a different application.
arXiv Detail & Related papers (2020-10-22T05:19:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.