Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels
- URL: http://arxiv.org/abs/2301.00545v4
- Date: Wed, 29 Nov 2023 10:10:04 GMT
- Title: Knockoffs-SPR: Clean Sample Selection in Learning with Noisy Labels
- Authors: Yikai Wang, Yanwei Fu, and Xinwei Sun
- Abstract summary: We propose a novel theoretically guaranteed clean sample selection framework for learning with noisy labels.
Knockoffs-SPR can be regarded as a sample selection module for a standard supervised training pipeline.
We further combine it with a semi-supervised algorithm to exploit the support of noisy data as unlabeled data.
- Score: 56.81761908354718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A noisy training set usually leads to the degradation of the generalization
and robustness of neural networks. In this paper, we propose a novel
theoretically guaranteed clean sample selection framework for learning with
noisy labels. Specifically, we first present a Scalable Penalized Regression
(SPR) method, to model the linear relation between network features and one-hot
labels. In SPR, the clean data are identified by the zero mean-shift parameters
solved in the regression model. We theoretically show that SPR can recover
clean data under some conditions. Under general scenarios, the conditions may
be no longer satisfied; and some noisy data are falsely selected as clean data.
To solve this problem, we propose a data-adaptive method for Scalable Penalized
Regression with Knockoff filters (Knockoffs-SPR), which is provable to control
the False-Selection-Rate (FSR) in the selected clean data. To improve the
efficiency, we further present a split algorithm that divides the whole
training set into small pieces that can be solved in parallel to make the
framework scalable to large datasets. While Knockoffs-SPR can be regarded as a
sample selection module for a standard supervised training pipeline, we further
combine it with a semi-supervised algorithm to exploit the support of noisy
data as unlabeled data. Experimental results on several benchmark datasets and
real-world noisy datasets show the effectiveness of our framework and validate
the theoretical results of Knockoffs-SPR. Our code and pre-trained models are
available at https://github.com/Yikai-Wang/Knockoffs-SPR.
Related papers
- Granular-ball Representation Learning for Deep CNN on Learning with Label Noise [14.082510085545582]
We propose a general granular-ball computing (GBC) module that can be embedded into a CNN model.
In this study, we split the input samples as $gb$ samples at feature-level, each of which can correspond to multiple samples with varying numbers and share one single label.
Experiments demonstrate that the proposed method can improve the robustness of CNN models with no additional data or optimization.
arXiv Detail & Related papers (2024-09-05T05:18:31Z) - Foster Adaptivity and Balance in Learning with Noisy Labels [26.309508654960354]
We propose a novel approach named textbfSED to deal with label noise in a textbfSelf-adaptivtextbfE and class-balancetextbfD manner.
A mean-teacher model is then employed to correct labels of noisy samples.
We additionally propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples.
arXiv Detail & Related papers (2024-07-03T03:10:24Z) - Combating Label Noise With A General Surrogate Model For Sample
Selection [84.61367781175984]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.
We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Label-Retrieval-Augmented Diffusion Models for Learning from Noisy
Labels [61.97359362447732]
Learning from noisy labels is an important and long-standing problem in machine learning for real applications.
In this paper, we reformulate the label-noise problem from a generative-model perspective.
Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets.
arXiv Detail & Related papers (2023-05-31T03:01:36Z) - Class Prototype-based Cleaner for Label Noise Learning [73.007001454085]
Semi-supervised learning methods are current SOTA solutions to the noisy-label learning problem.
We propose a simple yet effective solution, named textbfClass textbfPrototype-based label noise textbfCleaner.
arXiv Detail & Related papers (2022-12-21T04:56:41Z) - UNICON: Combating Label Noise Through Uniform Selection and Contrastive
Learning [89.56465237941013]
We propose UNICON, a simple yet effective sample selection method which is robust to high label noise.
We obtain an 11.4% improvement over the current state-of-the-art on CIFAR100 dataset with a 90% noise rate.
arXiv Detail & Related papers (2022-03-28T07:36:36Z) - Scalable Penalized Regression for Noise Detection in Learning with Noisy
Labels [44.79124350922491]
We propose using a theoretically guaranteed noisy label detection framework to detect and remove noisy data for Learning with Noisy Labels (LNL)
Specifically, we design a penalized regression to model the linear relation between network features and one-hot labels.
To make the framework scalable to datasets that contain a large number of categories and training data, we propose a split algorithm to divide the whole training set into small pieces.
arXiv Detail & Related papers (2022-03-15T11:09:58Z) - Robust Training under Label Noise by Over-parameterization [41.03008228953627]
We propose a principled approach for robust training of over-parameterized deep networks in classification tasks where a proportion of training labels are corrupted.
The main idea is yet very simple: label noise is sparse and incoherent with the network learned from clean data, so we model the noise and learn to separate it from the data.
Remarkably, when trained using such a simple method in practice, we demonstrate state-of-the-art test accuracy against label noise on a variety of real datasets.
arXiv Detail & Related papers (2022-02-28T18:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.