softmax is not enough (for sharp out-of-distribution)
- URL: http://arxiv.org/abs/2410.01104v2
- Date: Mon, 7 Oct 2024 13:13:41 GMT
- Title: softmax is not enough (for sharp out-of-distribution)
- Authors: Petar Veličković, Christos Perivolaropoulos, Federico Barbero, Razvan Pascanu,
- Abstract summary: Softmax function is key carrier of sharp behaviour in modern AI systems.
For tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time.
We propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
- Score: 16.167142726585357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
Related papers
- Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach.
Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z) - MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms.
We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.
We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z) - Revisiting Logistic-softmax Likelihood in Bayesian Meta-Learning for Few-Shot Classification [4.813254903898101]
logistic-softmax is often employed as an alternative to the softmax likelihood in multi-class Gaussian process classification.
We revisit and redesign the logistic-softmax likelihood, which enables control of the textita priori confidence level through a temperature parameter.
Our approach yields well-calibrated uncertainty estimates and achieves comparable or superior results on standard benchmark datasets.
arXiv Detail & Related papers (2023-10-16T13:20:13Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - Revisiting Softmax for Uncertainty Approximation in Text Classification [45.07154956156555]
Uncertainty approximation in text classification is an important area with applications in domain adaptation and interpretability.
One of the most widely used uncertainty approximation methods is Monte Carlo (MC) Dropout, which is computationally expensive.
We compare softmax and an efficient version of MC Dropout on their uncertainty approximations and downstream text classification performance.
We find that, while MC dropout produces the best uncertainty approximations, using a simple softmax leads to competitive and in some cases better uncertainty estimation for text classification at a much lower computational cost.
arXiv Detail & Related papers (2022-10-25T14:13:53Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Optimal Approximation -- Smoothness Tradeoffs for Soft-Max Functions [73.33961743410876]
A soft-max function has two main efficiency measures: approximation and smoothness.
We identify the optimal approximation-smoothness tradeoffs for different measures of approximation and smoothness.
This leads to novel soft-max functions, each of which is optimal for a different application.
arXiv Detail & Related papers (2020-10-22T05:19:58Z) - Loss Function Search for Face Recognition [75.79325080027908]
We develop a reward-guided search method to automatically obtain the best candidate.
Experimental results on a variety of face recognition benchmarks have demonstrated the effectiveness of our method.
arXiv Detail & Related papers (2020-07-10T03:40:10Z) - Non-convex Min-Max Optimization: Applications, Challenges, and Recent
Theoretical Advances [58.54078318403909]
The min-max problem, also known as the saddle point problem, is a class adversarial problem which is also studied in the context ofsum games.
arXiv Detail & Related papers (2020-06-15T05:33:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.