On the Role of Label Noise in the Feature Learning Process
- URL: http://arxiv.org/abs/2505.18909v1
- Date: Sun, 25 May 2025 00:13:28 GMT
- Title: On the Role of Label Noise in the Feature Learning Process
- Authors: Andi Han, Wei Huang, Zhanpeng Zhou, Gang Niu, Wuyang Chen, Junchi Yan, Akiko Takeda, Taiji Suzuki,
- Abstract summary: We consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise.<n>Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples while ignoring the noisy ones.<n>In Stage II, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples.
- Score: 90.49232384723268
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning with noisy labels presents significant challenges. In this work, we theoretically characterize the role of label noise from a feature learning perspective. Specifically, we consider a signal-noise data distribution, where each sample comprises a label-dependent signal and label-independent noise, and rigorously analyze the training dynamics of a two-layer convolutional neural network under this data setup, along with the presence of label noise. Our analysis identifies two key stages. In Stage I, the model perfectly fits all the clean samples (i.e., samples without label noise) while ignoring the noisy ones (i.e., samples with noisy labels). During this stage, the model learns the signal from the clean samples, which generalizes well on unseen data. In Stage II, as the training loss converges, the gradient in the direction of noise surpasses that of the signal, leading to overfitting on noisy samples. Eventually, the model memorizes the noise present in the noisy samples and degrades its generalization ability. Furthermore, our analysis provides a theoretical basis for two widely used techniques for tackling label noise: early stopping and sample selection. Experiments on both synthetic and real-world setups validate our theory.
Related papers
- Classifying Long-tailed and Label-noise Data via Disentangling and Unlearning [58.052712054684946]
In real-world datasets, the challenges of long-tailed distributions and noisy labels often coexist.<n>We propose a novel method called Disentangling and Unlearning for Long-tailed and Label-noisy data.
arXiv Detail & Related papers (2025-03-14T13:58:27Z) - RepFace: Refining Closed-Set Noise with Progressive Label Correction for Face Recognition [7.436952568257183]
Face recognition performance is heavily affected by the label noise, especially closed-set noise.<n>We propose a new framework to stabilize the training at early stages and split the samples into clean, ambiguous and noisy groups.<n>Our method achieves state-of-the-art results on mainstream face datasets.
arXiv Detail & Related papers (2024-12-16T17:57:33Z) - Mitigating Noisy Supervision Using Synthetic Samples with Soft Labels [13.314778587751588]
Noisy labels are ubiquitous in real-world datasets, especially in the large-scale ones derived from crowdsourcing and web searching.
It is challenging to train deep neural networks with noisy datasets since the networks are prone to overfitting the noisy labels during training.
We propose a framework that trains the model with new synthetic samples to mitigate the impact of noisy labels.
arXiv Detail & Related papers (2024-06-22T04:49:39Z) - Rethinking Noisy Label Learning in Real-world Annotation Scenarios from
the Noise-type Perspective [38.24239397999152]
We propose a novel sample selection-based approach for noisy label learning, called Proto-semi.
Proto-semi divides all samples into the confident and unconfident datasets via warm-up.
By leveraging the confident dataset, prototype vectors are constructed to capture class characteristics.
Empirical evaluations on a real-world annotated dataset substantiate the robustness of Proto-semi in handling the problem of learning from noisy labels.
arXiv Detail & Related papers (2023-07-28T10:57:38Z) - Instance-dependent Noisy-label Learning with Graphical Model Based Noise-rate Estimation [16.283722126438125]
Label Noise Learning (LNL) incorporates a sample selection stage to differentiate clean and noisy-label samples.
Such curriculum is sub-optimal since it does not consider the actual label noise rate in the training set.
This paper addresses this issue with a new noise-rate estimation method that is easily integrated with most state-of-the-art (SOTA) LNL methods.
arXiv Detail & Related papers (2023-05-31T01:46:14Z) - Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Learning from Noisy Labels with Coarse-to-Fine Sample Credibility
Modeling [22.62790706276081]
Training deep neural network (DNN) with noisy labels is practically challenging.
Previous efforts tend to handle part or full data in a unified denoising flow.
We propose a coarse-to-fine robust learning method called CREMA to handle noisy data in a divide-and-conquer manner.
arXiv Detail & Related papers (2022-08-23T02:06:38Z) - Label-Noise Learning with Intrinsically Long-Tailed Data [65.41318436799993]
We propose a learning framework for label-noise learning with intrinsically long-tailed data.
Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples.
arXiv Detail & Related papers (2022-08-21T07:47:05Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - Training Classifiers that are Universally Robust to All Label Noise
Levels [91.13870793906968]
Deep neural networks are prone to overfitting in the presence of label noise.
We propose a distillation-based framework that incorporates a new subcategory of Positive-Unlabeled learning.
Our framework generally outperforms at medium to high noise levels.
arXiv Detail & Related papers (2021-05-27T13:49:31Z) - Analysing the Noise Model Error for Realistic Noisy Label Data [14.766574408868806]
We study the quality of estimated noise models from the theoretical side by deriving the expected error of the noise model.
We also publish NoisyNER, a new noisy label dataset from the NLP domain.
arXiv Detail & Related papers (2021-01-24T17:45:15Z) - A Second-Order Approach to Learning with Instance-Dependent Label Noise [58.555527517928596]
The presence of label noise often misleads the training of deep neural networks.
We show that the errors in human-annotated labels are more likely to be dependent on the difficulty levels of tasks.
arXiv Detail & Related papers (2020-12-22T06:36:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.