Related papers: MaxSup: Overcoming Representation Collapse in Label Smoothing

MaxSup: Overcoming Representation Collapse in Label Smoothing

URL: http://arxiv.org/abs/2502.15798v1
Date: Tue, 18 Feb 2025 20:10:34 GMT
Title: MaxSup: Overcoming Representation Collapse in Label Smoothing
Authors: Yuxuan Zhou, Heng Li, Zhi-Qi Cheng, Xudong Yan, Mario Fritz, Margret Keuper,
Abstract summary: Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization.<n>Previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions.<n>We propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions.
Score: 55.067663157622384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Label Smoothing (LS) is widely adopted to curb overconfidence in neural network predictions and enhance generalization. However, previous research shows that LS can force feature representations into excessively tight clusters, eroding intra-class distinctions. More recent findings suggest that LS also induces overconfidence in misclassifications, yet the precise mechanism remained unclear. In this work, we decompose the loss term introduced by LS, revealing two key components: (i) a regularization term that functions only when the prediction is correct, and (ii) an error-enhancement term that emerges under misclassifications. This latter term compels the model to reinforce incorrect predictions with exaggerated certainty, further collapsing the feature space. To address these issues, we propose Max Suppression (MaxSup), which uniformly applies the intended regularization to both correct and incorrect predictions by penalizing the top-1 logit instead of the ground-truth logit. Through feature analyses, we show that MaxSup restores intra-class variation and sharpens inter-class boundaries. Extensive experiments on image classification and downstream tasks confirm that MaxSup is a more robust alternative to LS. Code is available at: https://github.com/ZhouYuxuanYX/Maximum-Suppression-Regularization.

Related papers

Sharp Trade-Offs in High-Dimensional Inference via 2-Level SLOPE [20.580487867158364]
We show that 2-level SLOPE offers a robust, scalable alternative to both LASSO and general SLOPE.<n>Our results suggest that 2-level SLOPE offers a robust, scalable alternative to both LASSO and general SLOPE.
arXiv Detail & Related papers (2025-07-12T01:57:10Z)
Disentangling Doubt in Deep Causal AI [0.0]
We propose a factorized Monte Carlo Dropout framework for deep twin-network models that splits total predictive variance into representation uncertainty.<n>Across three co-shift regimes, our intervals are well-calibrated and satisfy sigma_rep2 + sigma_pred2 sigma_tot2.<n>This module-level decomposition offers a practical diagnostic for detecting and interpreting uncertainty sources in deep causal-effect models.
arXiv Detail & Related papers (2025-07-04T14:48:51Z)
Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration [52.70324949884702]
We quantify the excess risk incurred using approximate posterior probabilities in batch binary decision-making. We identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost.
arXiv Detail & Related papers (2025-03-23T10:52:36Z)
Improved Feature Generating Framework for Transductive Zero-shot Learning [31.656888766677664]
Feature Generative Adversarial Networks have emerged as powerful generative models in producing high-quality representations of unseen classes. This paper delves into the pivotal influence of unseen class priors within the framework of Zero-shot Learning (TZSL) We introduce our Improved Feature Generation Framework, termed I-VAEGAN, which incorporates two novel components: Pseudo-conditional Feature Adversarial (PFA) learning and Variational Embedding Regression (VER)
arXiv Detail & Related papers (2024-12-24T08:42:16Z)
Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models. We validate this approach using four standard NLP benchmarks. We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z)
Towards Understanding Why Label Smoothing Degrades Selective Classification and How to Fix It [6.19039575840278]
Label smoothing (LS) is a popular regularisation method for training neural networks.<n>LS degrades uncertainty rank ordering of correct vs incorrect predictions.<n>We provide an explanation for this behaviour by analysing logit-level gradients.
arXiv Detail & Related papers (2024-03-19T06:46:24Z)
Classification under Nuisance Parameters and Generalized Label Shift in Likelihood-Free Inference [3.507509142413452]
We propose a new method for robust uncertainty quantification that casts classification as a hypothesis testing problem under nuisance parameters. Our method effectively endows a pre-trained classifier with domain adaptation capabilities and returns valid prediction sets while maintaining high power. We demonstrate its performance on two challenging scientific problems in biology and astroparticle physics with data from realistic mechanistic models.
arXiv Detail & Related papers (2024-02-08T00:12:18Z)
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning [59.44422468242455]
We propose a novel method dubbed ShrinkMatch to learn uncertain samples. For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class. We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations.
arXiv Detail & Related papers (2023-08-13T14:05:24Z)
When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks [64.12052498909105]
We study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. In two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples.
arXiv Detail & Related papers (2023-03-02T18:14:35Z)
Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z)
Taming Overconfident Prediction on Unlabeled Data from Hindsight [50.9088560433925]
Minimizing prediction uncertainty on unlabeled data is a key factor to achieve good performance in semi-supervised learning. This paper proposes a dual mechanism, named ADaptive Sharpening (ADS), which first applies a soft-threshold to adaptively mask out determinate and negligible predictions. ADS significantly improves the state-of-the-art SSL methods by making it a plug-in.
arXiv Detail & Related papers (2021-12-15T15:17:02Z)
The Devil is in the Margin: Margin-based Label Smoothing for Network Calibration [21.63888208442176]
In spite of the dominant performances of deep neural networks, recent works have shown that they are poorly calibrated. We provide a unifying constrained-optimization perspective of current state-of-the-art calibration losses. We propose a simple and flexible generalization based on inequality constraints, which imposes a controllable margin on logit distances.
arXiv Detail & Related papers (2021-11-30T14:21:47Z)
Second-Moment Loss: A Novel Regression Objective for Improved Uncertainties [7.766663822644739]
Quantification of uncertainty is one of the most promising approaches to establish safe machine learning. One of the most commonly used approaches so far is Monte Carlo dropout, which is computationally cheap and easy to apply in practice. We propose a new objective, referred to as second-moment loss ( UCI), to address this issue.
arXiv Detail & Related papers (2020-12-23T14:17:33Z)
Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks [65.24701908364383]
We show that a sufficient condition for a uncertainty on a ReLU network is "to be a bit Bayesian calibrated" We further validate these findings empirically via various standard experiments using common deep ReLU networks and Laplace approximations.
arXiv Detail & Related papers (2020-02-24T08:52:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.