Cut your Losses with Squentropy
- URL: http://arxiv.org/abs/2302.03952v1
- Date: Wed, 8 Feb 2023 09:21:13 GMT
- Title: Cut your Losses with Squentropy
- Authors: Like Hui, Mikhail Belkin, Stephen Wright
- Abstract summary: We propose the "squentropy" loss, which is the sum of two terms: the cross-entropy loss and the average square loss over the incorrect classes.
We show that the squentropy loss outperforms both the pure cross entropy and rescaled square losses in terms of the classification accuracy.
- Score: 19.924900110707284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nearly all practical neural models for classification are trained using
cross-entropy loss. Yet this ubiquitous choice is supported by little
theoretical or empirical evidence. Recent work (Hui & Belkin, 2020) suggests
that training using the (rescaled) square loss is often superior in terms of
the classification accuracy. In this paper we propose the "squentropy" loss,
which is the sum of two terms: the cross-entropy loss and the average square
loss over the incorrect classes. We provide an extensive set of experiments on
multi-class classification problems showing that the squentropy loss
outperforms both the pure cross entropy and rescaled square losses in terms of
the classification accuracy. We also demonstrate that it provides significantly
better model calibration than either of these alternative losses and,
furthermore, has less variance with respect to the random initialization.
Additionally, in contrast to the square loss, squentropy loss can typically be
trained using exactly the same optimization parameters, including the learning
rate, as the standard cross-entropy loss, making it a true "plug-and-play"
replacement. Finally, unlike the rescaled square loss, multiclass squentropy
contains no parameters that need to be adjusted.
Related papers
- A unified law of robustness for Bregman divergence losses [2.014089835498735]
We show that Bregman divergence losses form a common generalization of square loss and cross-entropy loss.
Our generalization relies on identifying a bias-variance type decomposition that lies at the heart of the proof and Bubeck and Sellke.
arXiv Detail & Related papers (2024-05-26T17:30:44Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Test-Time Adaptation via Conjugate Pseudo-labels [21.005027151753477]
Test-time adaptation (TTA) refers to adapting neural networks to distribution shifts.
Prior TTA methods optimize over unsupervised objectives such as the entropy of model predictions in TENT.
We present a surprising phenomenon: if we attempt to meta-learn the best possible TTA loss over a wide class of functions, then we recover a function that is remarkably similar to (a temperature-scaled version of) the softmax-entropy employed by TENT.
arXiv Detail & Related papers (2022-07-20T04:02:19Z) - Understanding Square Loss in Training Overparametrized Neural Network
Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks.
We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error.
The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z) - Mixing between the Cross Entropy and the Expectation Loss Terms [89.30385901335323]
Cross entropy loss tends to focus on hard to classify samples during training.
We show that adding to the optimization goal the expectation loss helps the network to achieve better accuracy.
Our experiments show that the new training protocol improves performance across a diverse set of classification domains.
arXiv Detail & Related papers (2021-09-12T23:14:06Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Optimization Variance: Exploring Generalization Properties of DNNs [83.78477167211315]
The test error of a deep neural network (DNN) often demonstrates double descent.
We propose a novel metric, optimization variance (OV), to measure the diversity of model updates.
arXiv Detail & Related papers (2021-06-03T09:34:17Z) - Shaping Deep Feature Space towards Gaussian Mixture for Visual
Classification [74.48695037007306]
We propose a Gaussian mixture (GM) loss function for deep neural networks for visual classification.
With a classification margin and a likelihood regularization, the GM loss facilitates both high classification performance and accurate modeling of the feature distribution.
The proposed model can be implemented easily and efficiently without using extra trainable parameters.
arXiv Detail & Related papers (2020-11-18T03:32:27Z) - MTAdam: Automatic Balancing of Multiple Training Loss Terms [95.99508450208813]
We generalize the Adam optimization algorithm to handle multiple loss terms.
We show that training with the new method leads to fast recovery from suboptimal initial loss weighting.
arXiv Detail & Related papers (2020-06-25T20:27:27Z) - Evaluation of Neural Architectures Trained with Square Loss vs
Cross-Entropy in Classification Tasks [23.538629997497747]
Cross-entropy loss is widely believed to be empirically superior to the square loss for classification tasks.
We show that these neural architectures perform comparably or better when trained with the square loss.
Cross-entropy appears to have a slight edge on computer vision tasks.
arXiv Detail & Related papers (2020-06-12T17:00:49Z) - Classification vs regression in overparameterized regimes: Does the loss
function matter? [21.75115239010008]
We show that solutions obtained by least-squares minimum-norm, typically used for regression, are identical to those produced by the hard-margin support vector machine (SVM)
Our results demonstrate the very different roles and properties of loss functions used at the training phase (optimization) and the testing phase (generalization)
arXiv Detail & Related papers (2020-05-16T17:58:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.