Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels
- URL: http://arxiv.org/abs/2007.02235v3
- Date: Fri, 21 Aug 2020 18:11:55 GMT
- Title: Unbiased Risk Estimators Can Mislead: A Case Study of Learning with
Complementary Labels
- Authors: Yu-Ting Chou, Gang Niu, Hsuan-Tien Lin, Masashi Sugiyama
- Abstract summary: We study a weakly supervised problem called learning with complementary labels.
We show that the quality of gradient estimation matters more in risk minimization.
We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
- Score: 92.98756432746482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In weakly supervised learning, unbiased risk estimator(URE) is a powerful
tool for training classifiers when training and test data are drawn from
different distributions. Nevertheless, UREs lead to overfitting in many problem
settings when the models are complex like deep networks. In this paper, we
investigate reasons for such overfitting by studying a weakly supervised
problem called learning with complementary labels. We argue the quality of
gradient estimation matters more in risk minimization. Theoretically, we show
that a URE gives an unbiased gradient estimator(UGE). Practically, however,
UGEs may suffer from huge variance, which causes empirical gradients to be
usually far away from true gradients during minimization. To this end, we
propose a novel surrogate complementary loss(SCL) framework that trades zero
bias with reduced variance and makes empirical gradients more aligned with true
gradients in the direction. Thanks to this characteristic, SCL successfully
mitigates the overfitting issue and improves URE-based methods.
Related papers
- Typicalness-Aware Learning for Failure Detection [26.23185979968123]
Deep neural networks (DNNs) often suffer from the overconfidence issue, where incorrect predictions are made with high confidence scores.
We propose a novel approach called Typicalness-Aware Learning (TAL) to address this issue and improve failure detection performance.
arXiv Detail & Related papers (2024-11-04T11:09:47Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Model-Aware Contrastive Learning: Towards Escaping the Dilemmas [11.27589489269041]
Contrastive learning (CL) continuously achieves significant breakthroughs across multiple domains.
InfoNCE-based methods suffer from some dilemmas, such as textituniformity-tolerance dilemma (UTD) and textitgradient reduction (UTD)
We present a Model-Aware Contrastive Learning (MACL) strategy, whose temperature is adaptive to the magnitude of alignment that reflects the basic confidence of the instance discrimination task.
arXiv Detail & Related papers (2022-07-16T08:21:55Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Gradient Imitation Reinforcement Learning for Low Resource Relation
Extraction [52.63803634033647]
Low-resource relation Extraction (LRE) aims to extract relation facts from limited labeled corpora when human annotation is scarce.
We develop a Gradient Imitation Reinforcement Learning method to encourage pseudo label data to imitate the gradient descent direction on labeled data.
We also propose a framework called GradLRE, which handles two major scenarios in low-resource relation extraction.
arXiv Detail & Related papers (2021-09-14T03:51:15Z) - On the Minimal Error of Empirical Risk Minimization [90.09093901700754]
We study the minimal error of the Empirical Risk Minimization (ERM) procedure in the task of regression.
Our sharp lower bounds shed light on the possibility (or impossibility) of adapting to simplicity of the model generating the data.
arXiv Detail & Related papers (2021-02-24T04:47:55Z) - On the Origin of Implicit Regularization in Stochastic Gradient Descent [22.802683068658897]
gradient descent (SGD) follows the path of gradient flow on the full batch loss function.
We prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite.
We verify that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
arXiv Detail & Related papers (2021-01-28T18:32:14Z) - Learning with Gradient Descent and Weakly Convex Losses [14.145079120746614]
We study the learning performance of gradient descent when the empirical risk is weakly convex.
In the case of a two layer neural network, we demonstrate that the empirical risk can satisfy a notion of local weak convexity.
arXiv Detail & Related papers (2021-01-13T09:58:06Z) - On the Convergence of SGD with Biased Gradients [28.400751656818215]
We analyze the guiding domain of biased gradient methods (SGD), where individual updates are corrupted by compression.
We quantify how many magnitudes of bias accuracy and convergence rates are impacted.
arXiv Detail & Related papers (2020-07-31T19:37:59Z) - Understanding Gradient Clipping in Private SGD: A Geometric Perspective [68.61254575987013]
Deep learning models are increasingly popular in many machine learning applications where the training data may contain sensitive information.
Many learning systems now incorporate differential privacy by training their models with (differentially) private SGD.
A key step in each private SGD update is gradient clipping that shrinks the gradient of an individual example whenever its L2 norm exceeds some threshold.
arXiv Detail & Related papers (2020-06-27T19:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.