Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research
- URL: http://arxiv.org/abs/2309.04318v2
- Date: Mon, 23 Sep 2024 13:15:52 GMT
- Title: Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research
- Authors: Sjoerd de Vries, Dirk Thierens,
- Abstract summary: We introduce SYNLABEL, a framework designed to create noiseless datasets informed by real-world data.
We demonstrate its ability to precisely quantify label noise and its improvement over existing methodologies.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many real-world classification tasks, label noise is an unavoidable issue that adversely affects the generalization error of machine learning models. Additionally, evaluating how methods handle such noise is complicated, as the effect label noise has on their performance cannot be accurately quantified without clean labels. Existing research on label noise typically relies on either noisy or oversimplified simulated data as a baseline, into which additional noise with known properties is injected. In this paper, we introduce SYNLABEL, a framework designed to address these limitations by creating noiseless datasets informed by real-world data. SYNLABEL supports defining a pre-specified or learned function as the ground truth function, which can then be used for generating new clean labels. Furthermore, by repeatedly resampling values for selected features within the domain of the function, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. These distributions capture the inherent uncertainty present in many real-world datasets and enable the direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity, into which various types of noise can be introduced. Additionally, they facilitate research into soft label learning and related applications. We demonstrate the application of SYNLABEL, showcasing its ability to precisely quantify label noise and its improvement over existing methodologies.
Related papers
- Inaccurate Label Distribution Learning with Dependency Noise [52.08553913094809]
We introduce the Dependent Noise-based Inaccurate Label Distribution Learning (DN-ILDL) framework to tackle the challenges posed by noise in label distribution learning.
We show that DN-ILDL effectively addresses the ILDL problem and outperforms existing LDL methods.
arXiv Detail & Related papers (2024-05-26T07:58:07Z) - Extracting Clean and Balanced Subset for Noisy Long-tailed Classification [66.47809135771698]
We develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching.
By setting a manually-specific probability measure, we can reduce the side-effects of noisy and long-tailed data simultaneously.
Our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
arXiv Detail & Related papers (2024-04-10T07:34:37Z) - Group Benefits Instances Selection for Data Purification [21.977432359384835]
Existing methods for combating label noise are typically designed and tested on synthetic datasets.
We propose a method named GRIP to alleviate the noisy label problem for both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-03-23T03:06:19Z) - NoisywikiHow: A Benchmark for Learning with Real-world Noisy Labels in
Natural Language Processing [26.678589684142548]
Large-scale datasets in the real world inevitably involve label noise.
Deep models can gradually overfit noisy labels and thus degrade generalization performance.
To mitigate the effects of label noise, learning with noisy labels (LNL) methods are designed to achieve better generalization performance.
arXiv Detail & Related papers (2023-05-18T05:01:04Z) - Rethinking the Value of Labels for Instance-Dependent Label Noise
Learning [43.481591776038144]
noisy labels in real-world applications often depend on both the true label and the features.
In this work, we tackle instance-dependent label noise with a novel deep generative model that avoids explicitly modeling the noise transition matrix.
Our algorithm leverages casual representation learning and simultaneously identifies the high-level content and style latent factors from the data.
arXiv Detail & Related papers (2023-05-10T15:29:07Z) - Learning with Noisy Labels Revisited: A Study Using Real-World Human
Annotations [54.400167806154535]
Existing research on learning with noisy labels mainly focuses on synthetic label noise.
This work presents two new benchmark datasets (CIFAR-10N, CIFAR-100N)
We show that real-world noisy labels follow an instance-dependent pattern rather than the classically adopted class-dependent ones.
arXiv Detail & Related papers (2021-10-22T22:42:11Z) - Instance-dependent Label-noise Learning under a Structural Causal Model [92.76400590283448]
Label noise will degenerate the performance of deep learning algorithms.
By leveraging a structural causal model, we propose a novel generative approach for instance-dependent label-noise learning.
arXiv Detail & Related papers (2021-09-07T10:42:54Z) - A Realistic Simulation Framework for Learning with Label Noise [17.14439597393087]
We show that this framework generates synthetic noisy labels that exhibit important characteristics of the label noise.
We also benchmark several existing algorithms for learning with noisy labels.
We propose a new technique, Label Quality Model (LQM), that leverages annotator features to predict and correct against noisy labels.
arXiv Detail & Related papers (2021-07-23T18:53:53Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - A Second-Order Approach to Learning with Instance-Dependent Label Noise [58.555527517928596]
The presence of label noise often misleads the training of deep neural networks.
We show that the errors in human-annotated labels are more likely to be dependent on the difficulty levels of tasks.
arXiv Detail & Related papers (2020-12-22T06:36:58Z) - Label Noise Types and Their Effects on Deep Learning [0.0]
In this work, we provide a detailed analysis of the effects of different kinds of label noise on learning.
We propose a generic framework to generate feature-dependent label noise, which we show to be the most challenging case for learning.
For the ease of other researchers to test their algorithms with noisy labels, we share corrupted labels for the most commonly used benchmark datasets.
arXiv Detail & Related papers (2020-03-23T18:03:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.