Related papers: Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It

URL: http://arxiv.org/abs/2403.14715v3
Date: Thu, 20 Feb 2025 15:02:44 GMT
Title: Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It
Authors: Guoxuan Xia, Olivier Laurent, Gianni Franchi, Christos-Savvas Bouganis,
Abstract summary: Label smoothing (LS) is a popular regularisation method for training neural networks.<n>LS degrades uncertainty rank ordering of correct vs incorrect predictions.<n>We provide an explanation for this behaviour by analysing logit-level gradients.
Score: 6.19039575840278
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Label smoothing (LS) is a popular regularisation method for training neural networks as it is effective in improving test accuracy and is simple to implement. ``Hard'' one-hot labels are ``smoothed'' by uniformly distributing probability mass to other classes, reducing overfitting. Prior work has suggested that in some cases LS can degrade selective classification (SC) -- where the aim is to reject misclassifications using a model's uncertainty. In this work, we first demonstrate empirically across an extended range of large-scale tasks and architectures that LS consistently degrades SC. We then address a gap in existing knowledge, providing an explanation for this behaviour by analysing logit-level gradients: LS degrades the uncertainty rank ordering of correct vs incorrect predictions by suppressing the max logit more when a prediction is likely to be correct, and less when it is likely to be wrong. This elucidates previously reported experimental results where strong classifiers underperform in SC. We then demonstrate the empirical effectiveness of post-hoc logit normalisation for recovering lost SC performance caused by LS. Furthermore, linking back to our gradient analysis, we again provide an explanation for why such normalisation is effective.

Related papers

LCGC: Learning from Consistency Gradient Conflicting for Class-Imbalanced Semi-Supervised Debiasing [7.868824589618853]
We theoretically analyze why exploiting a baseline image can refine pseudo-labels and prove that the black image is the best choice. We propose a debiasing scheme dubbed LCGC, which Learning from Gradient Conflicting, by encouraging biased class predictions. LCGC can significantly improve the prediction accuracy of existing CISSL models on public benchmarks.
arXiv Detail & Related papers (2025-04-09T02:57:53Z)
MaxSup: Overcoming Representation Collapse in Label Smoothing [55.067663157622384]
Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization. Previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions. We propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions.
arXiv Detail & Related papers (2025-02-18T20:10:34Z)
Adaptive Label Smoothing for Out-of-Distribution Detection [1.5999407512883508]
We propose a novel regularization method called adaptive label smoothing (ALS) ALS pushes the non-true classes to have same probabilities whereas the maximal probability is neither fixed nor limited. Our code will be available to the public.
arXiv Detail & Related papers (2024-10-08T15:35:11Z)
Do not trust what you trust: Miscalibration in Semi-supervised Learning [21.20806568508201]
State-of-the-art semi-supervised learning (SSL) approaches rely on highly confident predictions to serve as pseudo-labels that guide the training on unlabeled samples. We show that SSL methods based on pseudo-labels are significantly miscalibrated, and formally demonstrate the minimization of the min-entropy. We integrate a simple penalty term, which enforces the logit of the predictions on unlabeled samples to remain low, preventing the network predictions to become overconfident.
arXiv Detail & Related papers (2024-03-22T18:43:46Z)
Deep Imbalanced Regression via Hierarchical Classification Adjustment [50.19438850112964]
Regression tasks in computer vision are often formulated into classification by quantizing the target space into classes. The majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. We propose to construct hierarchical classifiers for solving imbalanced regression tasks. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks.
arXiv Detail & Related papers (2023-10-26T04:54:39Z)
Model Calibration in Dense Classification with Adaptive Label Perturbation [44.62722402349157]
Existing dense binary classification models are prone to being over-confident. We propose Adaptive Label Perturbation (ASLP) which learns a unique label perturbation level for each training image. ASLP can significantly improve calibration degrees of dense binary classification models on both in-distribution and out-of-distribution data.
arXiv Detail & Related papers (2023-07-25T14:40:11Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
Taming Overconfident Prediction on Unlabeled Data from Hindsight [50.9088560433925]
Minimizing prediction uncertainty on unlabeled data is a key factor to achieve good performance in semi-supervised learning. This paper proposes a dual mechanism, named ADaptive Sharpening (ADS), which first applies a soft-threshold to adaptively mask out determinate and negligible predictions. ADS significantly improves the state-of-the-art SSL methods by making it a plug-in.
arXiv Detail & Related papers (2021-12-15T15:17:02Z)
Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)
Re-Assessing the "Classify and Count" Quantification Method [88.60021378715636]
"Classify and Count" (CC) is often a biased estimator. Previous works have failed to use properly optimised versions of CC. We argue that, while still inferior to some cutting-edge methods, they deliver near-state-of-the-art accuracy.
arXiv Detail & Related papers (2020-11-04T21:47:39Z)
Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels. We show that the quality of gradient estimation matters more in risk minimization. We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.