MaxSup: Overcoming Representation Collapse in Label Smoothing
- URL: http://arxiv.org/abs/2502.15798v2
- Date: Mon, 02 Jun 2025 17:13:24 GMT
- Title: MaxSup: Overcoming Representation Collapse in Label Smoothing
- Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Yifei Dong, Mario Fritz, Margret Keuper,
- Abstract summary: Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions.<n>LS compacts feature representations into overly tight clusters, diluting intra-class diversity.<n>We propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions.
- Score: 52.66247931969715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Label Smoothing (LS) is widely adopted to reduce overconfidence in neural network predictions and improve generalization. Despite these benefits, recent studies reveal two critical issues with LS. First, LS induces overconfidence in misclassified samples. Second, it compacts feature representations into overly tight clusters, diluting intra-class diversity, although the precise cause of this phenomenon remained elusive. In this paper, we analytically decompose the LS-induced loss, exposing two key terms: (i) a regularization term that dampens overconfidence only when the prediction is correct, and (ii) an error-amplification term that arises under misclassifications. This latter term compels the network to reinforce incorrect predictions with undue certainty, exacerbating representation collapse. To address these shortcomings, we propose Max Suppression (MaxSup), which applies uniform regularization to both correct and incorrect predictions by penalizing the top-1 logit rather than the ground-truth logit. Through extensive feature-space analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Experiments on large-scale image classification and multiple downstream tasks confirm that MaxSup is a more robust alternative to LS, consistently reducing overconfidence while preserving richer feature representations. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization
Related papers
- Sharp Trade-Offs in High-Dimensional Inference via 2-Level SLOPE [20.580487867158364]
We show that 2-level SLOPE offers a robust, scalable alternative to both LASSO and general SLOPE.<n>Our results suggest that 2-level SLOPE offers a robust, scalable alternative to both LASSO and general SLOPE.
arXiv Detail & Related papers (2025-07-12T01:57:10Z) - Disentangling Doubt in Deep Causal AI [0.0]
We propose a factorized Monte Carlo Dropout framework for deep twin-network models that splits total predictive variance into representation uncertainty.<n>Across three co-shift regimes, our intervals are well-calibrated and satisfy sigma_rep2 + sigma_pred2 sigma_tot2.<n>This module-level decomposition offers a practical diagnostic for detecting and interpreting uncertainty sources in deep causal-effect models.
arXiv Detail & Related papers (2025-07-04T14:48:51Z) - Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making.
We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss.
On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv Detail & Related papers (2025-03-23T10:52:36Z) - Improved Feature Generating Framework for Transductive Zero-shot Learning [31.656888766677664]
Feature Generative Adversarial Networks have emerged as powerful generative models in producing high-quality representations of unseen classes.
This paper delves into the pivotal influence of unseen class priors within the framework of Zero-shot Learning (TZSL)
We introduce our Improved Feature Generation Framework, termed I-VAEGAN, which incorporates two novel components: Pseudo-conditional Feature Adversarial (PFA) learning and Variational Embedding Regression (VER)
arXiv Detail & Related papers (2024-12-24T08:42:16Z) - Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models.
We validate this approach using four standard NLP benchmarks.
We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z) - Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It [6.19039575840278]
Label smoothing (LS) is a popular regularisation method for training neural networks.<n>LS degrades uncertainty rank ordering of correct vs incorrect predictions.<n>We provide an explanation for this behaviour by analysing logit-level gradients.
arXiv Detail & Related papers (2024-03-19T06:46:24Z) - Classification under Nuisance Parameters and Generalized Label Shift in Likelihood-Free Inference [3.507509142413452]
We propose a new method for robust uncertainty quantification that casts classification as a hypothesis testing problem under nuisance parameters.
Our method effectively endows a pre-trained classifier with domain adaptation capabilities and returns valid prediction sets while maintaining high power.
We demonstrate its performance on two challenging scientific problems in biology and astroparticle physics with data from realistic mechanistic models.
arXiv Detail & Related papers (2024-02-08T00:12:18Z) - Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning [59.44422468242455]
We propose a novel method dubbed ShrinkMatch to learn uncertain samples.
For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class.
We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations.
arXiv Detail & Related papers (2023-08-13T14:05:24Z) - When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples.
A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction.
Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z) - The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness
in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks.
In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - Taming Overconfident Prediction on Unlabeled Data from Hindsight [50.9088560433925]
Minimizing prediction uncertainty on unlabeled data is a key factor to achieve good performance in semi-supervised learning.
This paper proposes a dual mechanism, named ADaptive Sharpening (ADS), which first applies a soft-threshold to adaptively mask out determinate and negligible predictions.
ADS significantly improves the state-of-the-art SSL methods by making it a plug-in.
arXiv Detail & Related papers (2021-12-15T15:17:02Z) - The Devil is in the Margin: Margin-based Label Smoothing for Network
Calibration [21.63888208442176]
In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated.
We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses.
We propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances.
arXiv Detail & Related papers (2021-11-30T14:21:47Z) - Second-Moment Loss: A Novel Regression Objective for Improved
Uncertainties [7.766663822644739]
Quantification of uncertainty is one of the most promising approaches to establish safe machine learning.
One of the most commonly used approaches so far is Monte Carlo dropout, which is computationally cheap and easy to apply in practice.
We propose a new objective, referred to as second-moment loss ( UCI), to address this issue.
arXiv Detail & Related papers (2020-12-23T14:17:33Z) - Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks [65.24701908364383]
We show that a sufficient condition for a uncertainty on a ReLU network is "to be a bit Bayesian calibrated"
We further validate these findings empirically via various standard experiments using common deep ReLU networks and Laplace approximations.
arXiv Detail & Related papers (2020-02-24T08:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.