Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments
Latent Variable Estimation
- URL: http://arxiv.org/abs/2103.02761v1
- Date: Wed, 3 Mar 2021 23:52:38 GMT
- Title: Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments
Latent Variable Estimation
- Authors: Mayee F. Chen, Benjamin Cohen-Wang, Stephen Mussmann, Frederic Sala,
Christopher R\'e
- Abstract summary: We use a framework centered on model misspecification in method-of-moments latent variable estimation.
We then introduce a correction that provably removes this bias in certain cases.
We observe theoretically and with synthetic experiments that for well-specified models, labeled points are worth a constant factor more than unlabeled points.
- Score: 17.212805760360954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Labeling data for modern machine learning is expensive and time-consuming.
Latent variable models can be used to infer labels from weaker,
easier-to-acquire sources operating on unlabeled data. Such models can also be
trained using labeled data, presenting a key question: should a user invest in
few labeled or many unlabeled points? We answer this via a framework centered
on model misspecification in method-of-moments latent variable estimation. Our
core result is a bias-variance decomposition of the generalization error, which
shows that the unlabeled-only approach incurs additional bias under
misspecification. We then introduce a correction that provably removes this
bias in certain cases. We apply our decomposition framework to three scenarios
-- well-specified, misspecified, and corrected models -- to 1) choose between
labeled and unlabeled data and 2) learn from their combination. We observe
theoretically and with synthetic experiments that for well-specified models,
labeled points are worth a constant factor more than unlabeled points. With
misspecification, however, their relative value is higher due to the additional
bias but can be reduced with correction. We also apply our approach to study
real-world weak supervision techniques for dataset construction.
Related papers
- Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism [4.675583319625962]
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models.
It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others.
We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
arXiv Detail & Related papers (2023-02-15T09:18:46Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Learning from Multiple Unlabeled Datasets with Partial Risk
Regularization [80.54710259664698]
In this paper, we aim to learn an accurate classifier without any class labels.
We first derive an unbiased estimator of the classification risk that can be estimated from the given unlabeled sets.
We then find that the classifier obtained as such tends to cause overfitting as its empirical risks go negative during training.
Experiments demonstrate that our method effectively mitigates overfitting and outperforms state-of-the-art methods for learning from multiple unlabeled sets.
arXiv Detail & Related papers (2022-07-04T16:22:44Z) - Active Learning by Feature Mixing [52.16150629234465]
We propose a novel method for batch active learning called ALFA-Mix.
We identify unlabelled instances with sufficiently-distinct features by seeking inconsistencies in predictions.
We show that inconsistencies in these predictions help discovering features that the model is unable to recognise in the unlabelled instances.
arXiv Detail & Related papers (2022-03-14T12:20:54Z) - Latent Outlier Exposure for Anomaly Detection with Contaminated Data [31.446666264334528]
Anomaly detection aims at identifying data points that show systematic deviations from the majority of data in an unlabeled dataset.
We propose a strategy for training an anomaly detector in the presence of unlabeled anomalies that is compatible with a broad class of models.
arXiv Detail & Related papers (2022-02-16T14:21:28Z) - Learning with Proper Partial Labels [87.65718705642819]
Partial-label learning is a kind of weakly-supervised learning with inexact labels.
We show that this proper partial-label learning framework includes many previous partial-label learning settings.
We then derive a unified unbiased estimator of the classification risk.
arXiv Detail & Related papers (2021-12-23T01:37:03Z) - Multi-class Probabilistic Bounds for Self-learning [13.875239300089861]
Pseudo-labeling is prone to error and runs the risk of adding noisy labels into unlabeled training data.
We present a probabilistic framework for analyzing self-learning in the multi-class classification scenario with partially labeled data.
arXiv Detail & Related papers (2021-09-29T13:57:37Z) - Unbiased Loss Functions for Multilabel Classification with Missing
Labels [2.1549398927094874]
Missing labels are a ubiquitous phenomenon in extreme multi-label classification (XMC) tasks.
This paper derives the unique unbiased estimators for the different multilabel reductions.
arXiv Detail & Related papers (2021-09-23T10:39:02Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - Improving Generalization of Deep Fault Detection Models in the Presence
of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise.
In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space.
In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.