MaxSup: Overcoming Representation Collapse in Label Smoothing
- URL: http://arxiv.org/abs/2502.15798v1
- Date: Tue, 18 Feb 2025 20:10:34 GMT
- Title: MaxSup: Overcoming Representation Collapse in Label Smoothing
- Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Mario Fritz, Margret Keuper,
- Abstract summary: Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization.<n>Previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions.<n>We propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions.
- Score: 55.067663157622384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization. However, previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions. More recent findings suggest that LS also induces overconfidence in misclassifications, yet the precise mechanism remained unclear. In this work, we decompose the loss term introduced by LS, revealing two key components: (i) a regularization term that functions only when the prediction is correct, and (ii) an error-enhancement term that emerges under misclassifications. This latter term compels the model to reinforce incorrect predictions with exaggerated certainty, further collapsing the feature space. To address these issues, we propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions by penalizing the top-1 logit instead of the ground-truth logit. Through feature analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Extensive experiments on image classification and downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization.
Related papers
- Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making.
We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss.
On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv Detail & Related papers (2025-03-23T10:52:36Z) - Improved Feature Generating Framework for Transductive Zero-shot Learning [31.656888766677664]
Feature Generative Adversarial Networks have emerged as powerful generative models in producing high-quality representations of unseen classes.
This paper delves into the pivotal influence of unseen class priors within the framework of Zero-shot Learning (TZSL)
We introduce our Improved Feature Generation Framework, termed I-VAEGAN, which incorporates two novel components: Pseudo-conditional Feature Adversarial (PFA) learning and Variational Embedding Regression (VER)
arXiv Detail & Related papers (2024-12-24T08:42:16Z) - Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models.
We validate this approach using four standard NLP benchmarks.
We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z) - Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It [6.19039575840278]
Label smoothing (LS) is a popular regularisation method for training neural networks.<n>LS degrades uncertainty rank ordering of correct vs incorrect predictions.<n>We provide an explanation for this behaviour by analysing logit-level gradients.
arXiv Detail & Related papers (2024-03-19T06:46:24Z) - Classification under Nuisance Parameters and Generalized Label Shift in Likelihood-Free Inference [3.507509142413452]
We propose a new method for robust uncertainty quantification that casts classification as a hypothesis testing problem under nuisance parameters.
Our method effectively endows a pre-trained classifier with domain adaptation capabilities and returns valid prediction sets while maintaining high power.
We demonstrate its performance on two challenging scientific problems in biology and astroparticle physics with data from realistic mechanistic models.
arXiv Detail & Related papers (2024-02-08T00:12:18Z) - When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples.
A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction.
Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - Taming Overconfident Prediction on Unlabeled Data from Hindsight [50.9088560433925]
Minimizing prediction uncertainty on unlabeled data is a key factor to achieve good performance in semi-supervised learning.
This paper proposes a dual mechanism, named ADaptive Sharpening (ADS), which first applies a soft-threshold to adaptively mask out determinate and negligible predictions.
ADS significantly improves the state-of-the-art SSL methods by making it a plug-in.
arXiv Detail & Related papers (2021-12-15T15:17:02Z) - The Devil is in the Margin: Margin-based Label Smoothing for Network
Calibration [21.63888208442176]
In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated.
We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses.
We propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances.
arXiv Detail & Related papers (2021-11-30T14:21:47Z) - Second-Moment Loss: A Novel Regression Objective for Improved
Uncertainties [7.766663822644739]
Quantification of uncertainty is one of the most promising approaches to establish safe machine learning.
One of the most commonly used approaches so far is Monte Carlo dropout, which is computationally cheap and easy to apply in practice.
We propose a new objective, referred to as second-moment loss ( UCI), to address this issue.
arXiv Detail & Related papers (2020-12-23T14:17:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.