Promises and Pitfalls of Threshold-based Auto-labeling
- URL: http://arxiv.org/abs/2211.12620v2
- Date: Thu, 22 Feb 2024 02:47:53 GMT
- Title: Promises and Pitfalls of Threshold-based Auto-labeling
- Authors: Harit Vishwakarma, Heguang Lin, Frederic Sala, Ramya Korlakai Vinayak
- Abstract summary: Threshold-based auto-labeling (TBAL)
We derive complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data.
We validate our theoretical guarantees with extensive experiments on synthetic and real datasets.
- Score: 17.349289155257715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Creating large-scale high-quality labeled datasets is a major bottleneck in
supervised machine learning workflows. Threshold-based auto-labeling (TBAL),
where validation data obtained from humans is used to find a confidence
threshold above which the data is machine-labeled, reduces reliance on manual
annotation. TBAL is emerging as a widely-used solution in practice. Given the
long shelf-life and diverse usage of the resulting datasets, understanding when
the data obtained by such auto-labeling systems can be relied on is crucial.
This is the first work to analyze TBAL systems and derive sample complexity
bounds on the amount of human-labeled validation data required for guaranteeing
the quality of machine-labeled data. Our results provide two crucial insights.
First, reasonable chunks of unlabeled data can be automatically and accurately
labeled by seemingly bad models. Second, a hidden downside of TBAL systems is
potentially prohibitive validation data usage. Together, these insights
describe the promise and pitfalls of using such systems. We validate our
theoretical guarantees with extensive experiments on synthetic and real
datasets.
Related papers
- Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object
Detection with Repeated Labels [6.872072177648135]
We propose a novel localization algorithm that adapts well-established ground truth estimation methods.
Our algorithm also shows superior performance during training on the TexBiG dataset.
arXiv Detail & Related papers (2023-09-18T13:08:44Z) - Self-refining of Pseudo Labels for Music Source Separation with Noisy
Labeled Data [15.275949700129797]
Music source separation (MSS) faces challenges due to the limited availability of correctly-labeled individual instrument tracks.
This paper introduces an automated technique for refining the labels in a partially mislabeled dataset.
Our proposed self-refining technique, employed with a noisy-labeled dataset, results in only a 1% accuracy degradation in multi-label instrument recognition.
arXiv Detail & Related papers (2023-07-24T07:47:21Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Doubly Robust Self-Training [46.168395767948965]
We introduce doubly robust self-training, a novel semi-supervised algorithm.
We demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
arXiv Detail & Related papers (2023-06-01T00:57:16Z) - A Benchmark Generative Probabilistic Model for Weak Supervised Learning [2.0257616108612373]
Weak Supervised Learning approaches have been developed to alleviate the annotation burden.
We show that latent variable models (PLVMs) achieve state-of-the-art performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:06:24Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Data Consistency for Weakly Supervised Learning [15.365232702938677]
Training machine learning models involves using large amounts of human-annotated data.
We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals.
We show that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.
arXiv Detail & Related papers (2022-02-08T16:48:19Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.