Related papers: Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification

Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification

URL: http://arxiv.org/abs/2504.15594v1
Date: Tue, 22 Apr 2025 05:14:38 GMT
Title: Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification
Authors: Tatsuhito Hasegawa, Shunsuke Sakai,
Abstract summary: In deep learning-based classification tasks, the temperature parameter $T$ critically influences the output distribution and overall performance.<n>This study presents a novel theoretical insight that the optimal temperature $T*$ is uniquely determined by the dimensionality of the feature representations.<n>We develop an empirical formula to estimate $T*$ without additional training while also introducing a corrective scheme to refine $T*$ based on the number of classes and task complexity.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In deep learning-based classification tasks, the softmax function's temperature parameter $T$ critically influences the output distribution and overall performance. This study presents a novel theoretical insight that the optimal temperature $T^*$ is uniquely determined by the dimensionality of the feature representations, thereby enabling training-free determination of $T^*$. Despite this theoretical grounding, empirical evidence reveals that $T^*$ fluctuates under practical conditions owing to variations in models, datasets, and other confounding factors. To address these influences, we propose and optimize a set of temperature determination coefficients that specify how $T^*$ should be adjusted based on the theoretical relationship to feature dimensionality. Additionally, we insert a batch normalization layer immediately before the output layer, effectively stabilizing the feature space. Building on these coefficients and a suite of large-scale experiments, we develop an empirical formula to estimate $T^*$ without additional training while also introducing a corrective scheme to refine $T^*$ based on the number of classes and task complexity. Our findings confirm that the derived temperature not only aligns with the proposed theoretical perspective but also generalizes effectively across diverse tasks, consistently enhancing classification performance and offering a practical, training-free solution for determining $T^*$.

Related papers

Gradient-free stochastic optimization for additive models [56.42455605591779]
We address the problem of zero-order optimization from noisy observations for an objective function satisfying the Polyak-Lojasiewicz or the strong convexity condition.<n>We assume that the objective function has an additive structure and satisfies a higher-order smoothness property, characterized by the H"older family of functions.
arXiv Detail & Related papers (2025-03-03T23:39:08Z)
Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness [8.934328206473456]
This study delves into the often-overlooked parameter within the softmax function, known as "temperature"<n>Our empirical studies, adopting convolutional neural networks and transformers, reveal that moderate temperatures generally introduce better overall performance.<n>For the first time, we discover a surprising benefit of elevated temperatures: enhanced model robustness against common corruption, natural perturbation, and non-targeted adversarial attacks like Projected Gradient Descent.
arXiv Detail & Related papers (2025-02-28T00:07:45Z)
Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z)
Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? [27.924615931679757]
We explore the impacts of a dense-to-sparse gating mixture of experts (MoE) on the maximum likelihood estimation under the MoE. We propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function.
arXiv Detail & Related papers (2024-01-25T01:09:09Z)
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method. We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z)
Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization [51.41175648612714]
We propose a new robust contrastive loss inspired by distributionally robust optimization (DRO) We show that our algorithm automatically learns a suitable $tau$ for each sample. Our method outperforms prior strong baselines on unimodal and bimodal datasets.
arXiv Detail & Related papers (2023-05-19T19:25:56Z)
Functional Linear Regression of Cumulative Distribution Functions [20.96177061945288]
We propose functional ridge-regression-based estimation methods that estimate CDFs accurately everywhere. We show estimation error upper bounds of $widetilde O(sqrtd/n)$ for fixed design, random design, and adversarial context cases. We formalize infinite dimensional models where the parameter space is an infinite dimensional Hilbert space, and establish a self-normalized estimation error upper bound for this setting.
arXiv Detail & Related papers (2022-05-28T23:59:50Z)
Addressing Maximization Bias in Reinforcement Learning with Two-Sample Testing [0.0]
Overestimation bias is a known threat to value-based reinforcement-learning algorithms. We propose a $T$-Estimator (TE) based on two-sample testing for the mean, that flexibly interpolates between over- and underestimation. We also introduce a generalization, termed $K$-Estimator (KE), that obeys the same bias and variance bounds as the TE.
arXiv Detail & Related papers (2022-01-20T09:22:43Z)
Pseudo-Spherical Contrastive Divergence [119.28384561517292]
We propose pseudo-spherical contrastive divergence (PS-CD) to generalize maximum learning likelihood of energy-based models. PS-CD avoids the intractable partition function and provides a generalized family of learning objectives.
arXiv Detail & Related papers (2021-11-01T09:17:15Z)
Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer. This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$. We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z)
Temperature check: theory and practice for training models with softmax-cross-entropy losses [21.073524360170833]
We develop a theory of early learning for models trained with softmax-cross-entropy loss. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude.
arXiv Detail & Related papers (2020-10-14T18:26:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.