Effectiveness of MPC-friendly Softmax Replacement
- URL: http://arxiv.org/abs/2011.11202v2
- Date: Tue, 6 Jul 2021 12:32:48 GMT
- Title: Effectiveness of MPC-friendly Softmax Replacement
- Authors: Marcel Keller and Ke Sun
- Abstract summary: We analyze the two uses of the softmax replacement and compare them to softmax.
We found that the replacement only provides a significant speed-up for a one-layer network while it always reduces accuracy, sometimes significantly.
- Score: 13.710300609457267
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Softmax is widely used in deep learning to map some representation to a
probability distribution. As it is based on exp/log functions that are
relatively expensive in multi-party computation, Mohassel and Zhang (2017)
proposed a simpler replacement based on ReLU to be used in secure computation.
However, we could not reproduce the accuracy they reported for training on
MNIST with three fully connected layers. Later works (e.g., Wagh et al., 2019
and 2021) used the softmax replacement not for computing the output probability
distribution but for approximating the gradient in back-propagation. In this
work, we analyze the two uses of the replacement and compare them to softmax,
both in terms of accuracy and cost in multi-party computation. We found that
the replacement only provides a significant speed-up for a one-layer network
while it always reduces accuracy, sometimes significantly. Thus we conclude
that its usefulness is limited and one should use the original softmax function
instead.
Related papers
- Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.
Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z) - MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms.
We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.
We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z) - Revisiting the Architectures like Pointer Networks to Efficiently
Improve the Next Word Distribution, Summarization Factuality, and Beyond [37.96043934146189]
We propose several softmax alternatives by simplifying the pointer networks and accelerating the word-by-word rerankers.
In GPT-2, our proposals are significantly better and more efficient than mixture of softmax.
Our best method based on T5-Small improves factCC score by 2 points in CNN/DM and XSUM dataset, and improves MAUVE scores by 30% in BookSum paragraph-level dataset.
arXiv Detail & Related papers (2023-05-20T21:52:24Z) - Attention Scheme Inspired Softmax Regression [20.825033982038455]
Large language models (LLMs) have made transformed changes for human society.
One of the key computation in LLMs is the softmax unit.
In this work, inspired the softmax unit, we define a softmax regression problem.
arXiv Detail & Related papers (2023-04-20T15:50:35Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - SOFT: Softmax-free Transformer with Linear Complexity [112.9754491864247]
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention.
Various attempts on approximating the self-attention with linear complexity have been made in Natural Language Processing.
We identify that their limitations are rooted in keeping the softmax self-attention during approximations.
For the first time, a softmax-free transformer or SOFT is proposed.
arXiv Detail & Related papers (2021-10-22T17:57:29Z) - Provably Breaking the Quadratic Error Compounding Barrier in Imitation
Learning, Optimally [58.463668865380946]
We study the statistical limits of Imitation Learning in episodic Markov Decision Processes (MDPs) with a state space $mathcalS$.
We establish an upper bound $O(|mathcalS|H3/2/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al ( 2020)
We show the minimax suboptimality grows as $Omega( H3/2/N)$ when $mathcalS|geq 3$ while the unknown-transition setting suffers from a larger sharp rate
arXiv Detail & Related papers (2021-02-25T15:50:19Z) - Efficient semidefinite-programming-based inference for binary and
multi-class MRFs [83.09715052229782]
We propose an efficient method for computing the partition function or MAP estimate in a pairwise MRF.
We extend semidefinite relaxations from the typical binary MRF to the full multi-class setting, and develop a compact semidefinite relaxation that can again be solved efficiently using the solver.
arXiv Detail & Related papers (2020-12-04T15:36:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.