Temperature check: theory and practice for training models with
softmax-cross-entropy losses
- URL: http://arxiv.org/abs/2010.07344v1
- Date: Wed, 14 Oct 2020 18:26:23 GMT
- Title: Temperature check: theory and practice for training models with
softmax-cross-entropy losses
- Authors: Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz
- Abstract summary: We develop a theory of early learning for models trained with softmax-cross-entropy loss.
We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude.
- Score: 21.073524360170833
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The softmax function combined with a cross-entropy loss is a principled
approach to modeling probability distributions that has become ubiquitous in
deep learning. The softmax function is defined by a lone hyperparameter, the
temperature, that is commonly set to one or regarded as a way to tune model
confidence after training; however, less is known about how the temperature
impacts training dynamics or generalization performance. In this work we
develop a theory of early learning for models trained with
softmax-cross-entropy loss and show that the learning dynamics depend crucially
on the inverse-temperature $\beta$ as well as the magnitude of the logits at
initialization, $||\beta{\bf z}||_{2}$. We follow up these analytic results
with a large-scale empirical study of a variety of model architectures trained
on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization
performance depends strongly on the temperature, but only weakly on the initial
logit magnitude. We provide evidence that the dependence of generalization on
$\beta$ is not due to changes in model confidence, but is a dynamical
phenomenon. It follows that the addition of $\beta$ as a tunable hyperparameter
is key to maximizing model performance. Although we find the optimal $\beta$ to
be sensitive to the architecture, our results suggest that tuning $\beta$ over
the range $10^{-2}$ to $10^1$ improves performance over all architectures
studied. We find that smaller $\beta$ may lead to better peak performance at
the cost of learning stability.
Related papers
- Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification [0.0]
In deep learning-based classification tasks, the temperature parameter $T$ critically influences the output distribution and overall performance.
This study presents a novel theoretical insight that the optimal temperature $T*$ is uniquely determined by the dimensionality of the feature representations.
We develop an empirical formula to estimate $T*$ without additional training while also introducing a corrective scheme to refine $T*$ based on the number of classes and task complexity.
arXiv Detail & Related papers (2025-04-22T05:14:38Z) - Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness [8.934328206473456]
This study delves into the often-overlooked parameter within the softmax function, known as "temperature"
Our empirical studies, adopting convolutional neural networks and transformers, reveal that moderate temperatures generally introduce better overall performance.
For the first time, we discover a surprising benefit of elevated temperatures: enhanced model robustness against common corruption, natural perturbation, and non-targeted adversarial attacks like Projected Gradient Descent.
arXiv Detail & Related papers (2025-02-28T00:07:45Z) - Gradient dynamics for low-rank fine-tuning beyond kernels [9.275532709125242]
We study low-rank fine-tuning in a student-teacher setting.
We prove under mild assumptions that a student model which is matrix at the base model and trained with online gradient descent will converge to the teacher.
arXiv Detail & Related papers (2024-11-23T00:00:28Z) - The Optimization Landscape of SGD Across the Feature Learning Strength [102.1353410293931]
We study the effect of scaling $gamma$ across a variety of models and datasets in the online training setting.
We find that optimal online performance is often found at large $gamma$.
Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
arXiv Detail & Related papers (2024-10-06T22:30:14Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity
Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs)
We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up.
In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z) - Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method.
We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization.
We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z) - STORM: Efficient Stochastic Transformer based World Models for
Reinforcement Learning [82.03481509373037]
Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments.
We introduce Transformer-based wORld Model (STORM), an efficient world model architecture that combines strong modeling and generation capabilities.
Storm achieves a mean human performance of $126.7%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods.
arXiv Detail & Related papers (2023-10-14T16:42:02Z) - Towards Alternative Techniques for Improving Adversarial Robustness:
Analysis of Adversarial Training at a Spectrum of Perturbations [5.18694590238069]
Adversarial training (AT) and its variants have spearheaded progress in improving neural network robustness to adversarial perturbations.
We focus on models, trained on a spectrum of $epsilon$ values.
We identify alternative improvements to AT that otherwise wouldn't have been apparent at a single $epsilon$.
arXiv Detail & Related papers (2022-06-13T22:01:21Z) - Revisiting Model-based Value Expansion [35.55280687116388]
Model-based value expansion methods promise to improve the quality of value function targets and the effectiveness of value function learning.
However, to date, these methods are being outperformed by Dyna-style algorithms with conceptually simpler 1-step value function targets.
We provide a thorough empirical study to shed light on the causes of failure of value expansion methods in practice.
arXiv Detail & Related papers (2022-03-28T11:21:49Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.