Quantity vs Quality: Investigating the Trade-Off between Sample Size and
Label Reliability
- URL: http://arxiv.org/abs/2204.09462v1
- Date: Wed, 20 Apr 2022 13:52:00 GMT
- Title: Quantity vs Quality: Investigating the Trade-Off between Sample Size and
Label Reliability
- Authors: Timo Bertram, Johannes F\"urnkranz, Martin M\"uller
- Abstract summary: We study learning in probabilistic domains where the learner may receive incorrect labels but can improve the reliability of labels by repeatedly sampling them.
We motivate this problem in an application to compare the strength of poker hands where the training signal depends on the hidden community cards.
We propose two different validation strategies; switching from lower to higher validations over the course of training and using chi-square statistics to approximate the confidence in obtained labels.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study learning in probabilistic domains where the learner
may receive incorrect labels but can improve the reliability of labels by
repeatedly sampling them. In such a setting, one faces the problem of whether
the fixed budget for obtaining training examples should rather be used for
obtaining all different examples or for improving the label quality of a
smaller number of examples by re-sampling their labels. We motivate this
problem in an application to compare the strength of poker hands where the
training signal depends on the hidden community cards, and then study it in
depth in an artificial setting where we insert controlled noise levels into the
MNIST database. Our results show that with increasing levels of noise,
resampling previous examples becomes increasingly more important than obtaining
new examples, as classifier performance deteriorates when the number of
incorrect labels is too high. In addition, we propose two different validation
strategies; switching from lower to higher validations over the course of
training and using chi-square statistics to approximate the confidence in
obtained labels.
Related papers
- Perceptual Quality-based Model Training under Annotator Label Uncertainty [15.015925663078377]
Annotators exhibit disagreement during data labeling, which can be termed as annotator label uncertainty.
We introduce a novel perceptual quality-based model training framework to objectively generate multiple labels for model training.
arXiv Detail & Related papers (2024-03-15T10:52:18Z) - Robust Assignment of Labels for Active Learning with Sparse and Noisy
Annotations [0.17188280334580192]
Supervised classification algorithms are used to solve a growing number of real-life problems around the globe.
Unfortunately, acquiring good-quality annotations for many tasks is infeasible or too expensive to be done in practice.
We propose two novel annotation unification algorithms that utilize unlabeled parts of the sample space.
arXiv Detail & Related papers (2023-07-25T19:40:41Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - An analysis of over-sampling labeled data in semi-supervised learning
with FixMatch [66.34968300128631]
Most semi-supervised learning methods over-sample labeled data when constructing training mini-batches.
This paper studies whether this common practice improves learning and how.
We compare it to an alternative setting where each mini-batch is uniformly sampled from all the training data, labeled or not.
arXiv Detail & Related papers (2022-01-03T12:22:26Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - An Ensemble Noise-Robust K-fold Cross-Validation Selection Method for
Noisy Labels [0.9699640804685629]
Large-scale datasets tend to contain mislabeled samples that can be memorized by deep neural networks (DNNs)
We present Ensemble Noise-robust K-fold Cross-Validation Selection (E-NKCVS) to effectively select clean samples from noisy data.
We evaluate our approach on various image and text classification tasks where the labels have been manually corrupted with different noise ratios.
arXiv Detail & Related papers (2021-07-06T02:14:52Z) - Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels.
We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z) - Exploiting Sample Uncertainty for Domain Adaptive Person
Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels.
Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z) - Importance Reweighting for Biquality Learning [0.0]
This paper proposes an original, encompassing, view of Weakly Supervised Learning.
It results in the design of generic approaches capable of dealing with any kind of label noise.
In this paper, we propose a new reweigthing scheme capable of identifying noncorrupted examples in the untrusted dataset.
arXiv Detail & Related papers (2020-10-19T15:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.