Related papers: Temperature check: theory and practice for training models with softmax-cross-entropy losses

Temperature check: theory and practice for training models with softmax-cross-entropy losses

URL: http://arxiv.org/abs/2010.07344v1
Date: Wed, 14 Oct 2020 18:26:23 GMT
Title: Temperature check: theory and practice for training models with softmax-cross-entropy losses
Authors: Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz
Abstract summary: We develop a theory of early learning for models trained with softmax-cross-entropy loss. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude.
Score: 21.073524360170833
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on $\beta$ is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of $\beta$ as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal $\beta$ to be sensitive to the architecture, our results suggest that tuning $\beta$ over the range $10^{-2}$ to $10^1$ improves performance over all architectures studied. We find that smaller $\beta$ may lead to better peak performance at the cost of learning stability.

Related papers

Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization [15.458541841436967]
We study the pivotal role of the softmax function in shaping the model's representation.<n>We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes.<n>We demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data.
arXiv Detail & Related papers (2025-06-02T11:38:10Z)
Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training [8.824077990271503]
We investigate the role of the training dynamics in the transition from generalization to memorization.<n>We find that $tau_mathrmmem$ increases linearly with the training set size $n$, while $tau_mathrmgen$ remains constant.<n>It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times.
arXiv Detail & Related papers (2025-05-23T08:58:47Z)
Analytical Softmax Temperature Setting from Feature Dimensions for Model- and Domain-Robust Classification [0.0]
In deep learning-based classification tasks, the temperature parameter $T$ critically influences the output distribution and overall performance. This study presents a novel theoretical insight that the optimal temperature $T*$ is uniquely determined by the dimensionality of the feature representations. We develop an empirical formula to estimate $T*$ without additional training while also introducing a corrective scheme to refine $T*$ based on the number of classes and task complexity.
arXiv Detail & Related papers (2025-04-22T05:14:38Z)
Exploring the Impact of Temperature Scaling in Softmax for Classification and Adversarial Robustness [8.934328206473456]
This study delves into the often-overlooked parameter within the softmax function, known as "temperature" Our empirical studies, adopting convolutional neural networks and transformers, reveal that moderate temperatures generally introduce better overall performance. For the first time, we discover a surprising benefit of elevated temperatures: enhanced model robustness against common corruption, natural perturbation, and non-targeted adversarial attacks like Projected Gradient Descent.
arXiv Detail & Related papers (2025-02-28T00:07:45Z)
Feasible Learning [78.6167929413604]
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample.<n>Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
arXiv Detail & Related papers (2025-01-24T20:39:38Z)
Gradient dynamics for low-rank fine-tuning beyond kernels [9.275532709125242]
We study low-rank fine-tuning in a student-teacher setting. We prove under mild assumptions that a student model which is matrix at the base model and trained with online gradient descent will converge to the teacher.
arXiv Detail & Related papers (2024-11-23T00:00:28Z)
The Optimization Landscape of SGD Across the Feature Learning Strength [102.1353410293931]
We study the effect of scaling $gamma$ across a variety of models and datasets in the online training setting. We find that optimal online performance is often found at large $gamma$. Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
arXiv Detail & Related papers (2024-10-06T22:30:14Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs) We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up. In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z)
Temperature Balancing, Layer-wise Weight Analysis, and Neural Network Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method. We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization. We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z)
STORM: Efficient Stochastic Transformer based World Models for Reinforcement Learning [82.03481509373037]
Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments. We introduce Transformer-based wORld Model (STORM), an efficient world model architecture that combines strong modeling and generation capabilities. Storm achieves a mean human performance of $126.7%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods.
arXiv Detail & Related papers (2023-10-14T16:42:02Z)
Towards Alternative Techniques for Improving Adversarial Robustness: Analysis of Adversarial Training at a Spectrum of Perturbations [5.18694590238069]
Adversarial training (AT) and its variants have spearheaded progress in improving neural network robustness to adversarial perturbations. We focus on models, trained on a spectrum of $epsilon$ values. We identify alternative improvements to AT that otherwise wouldn't have been apparent at a single $epsilon$.
arXiv Detail & Related papers (2022-06-13T22:01:21Z)
Revisiting Model-based Value Expansion [35.55280687116388]
Model-based value expansion methods promise to improve the quality of value function targets and the effectiveness of value function learning. However, to date, these methods are being outperformed by Dyna-style algorithms with conceptually simpler 1-step value function targets. We provide a thorough empirical study to shed light on the causes of failure of value expansion methods in practice.
arXiv Detail & Related papers (2022-03-28T11:21:49Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.