In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for
Self-Training in Semi-Supervised Learning
- URL: http://arxiv.org/abs/2303.01117v1
- Date: Thu, 2 Mar 2023 10:00:37 GMT
- Title: In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for
Self-Training in Semi-Supervised Learning
- Authors: Julian Rodemann, Christoph Jansen, Georg Schollmeyer, Thomas Augustin
- Abstract summary: Self-training is a simple yet effective method within semi-supervised learning.
In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions.
Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-training is a simple yet effective method within semi-supervised
learning. The idea is to iteratively enhance training data by adding
pseudo-labeled data. Its generalization performance heavily depends on the
selection of these pseudo-labeled data (PLS). In this paper, we aim at
rendering PLS more robust towards the involved modeling assumptions. To this
end, we propose to select pseudo-labeled data that maximize a multi-objective
utility function. The latter is constructed to account for different sources of
uncertainty, three of which we discuss in more detail: model selection,
accumulation of errors and covariate shift. In the absence of second-order
information on such uncertainties, we furthermore consider the generic approach
of the generalized Bayesian alpha-cut updating rule for credal sets. As a
practical proof of concept, we spotlight the application of three of our robust
extensions on simulated and real-world data. Results suggest that in particular
robustness w.r.t. model choice can lead to substantial accuracy gains.
Related papers
- Uncertainty-aware self-training with expectation maximization basis transformation [9.7527450662978]
We propose a new self-training framework to combine uncertainty information of both model and dataset.
Specifically, we propose to use Expectation-Maximization (EM) to smooth the labels and comprehensively estimate the uncertainty information.
arXiv Detail & Related papers (2024-05-02T11:01:31Z) - Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning [47.02160072880698]
We introduce a self-evolving mechanism that allows the model itself to actively sample subsets that are equally or even more effective.
The key to our data sampling technique lies in the enhancement of diversity in the chosen subsets.
Extensive experiments across three datasets and benchmarks demonstrate the effectiveness of DiverseEvol.
arXiv Detail & Related papers (2023-11-14T14:10:40Z) - Pseudo Label Selection is a Decision Problem [0.0]
Pseudo-Labeling is a simple and effective approach to semi-supervised learning.
It requires criteria that guide the selection of pseudo-labeled data.
Overfitting can be propagated to the final model by choosing instances with overconfident but wrong predictions.
arXiv Detail & Related papers (2023-09-25T07:48:02Z) - Robust Outlier Rejection for 3D Registration with Variational Bayes [70.98659381852787]
We develop a novel variational non-local network-based outlier rejection framework for robust alignment.
We propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation.
arXiv Detail & Related papers (2023-04-04T03:48:56Z) - Enhancing Self-Training Methods [0.0]
Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data.
Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias"
arXiv Detail & Related papers (2023-01-18T03:56:17Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones.
We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - Out-distribution aware Self-training in an Open World Setting [62.19882458285749]
We leverage unlabeled data in an open world setting to further improve prediction performance.
We introduce out-distribution aware self-training, which includes a careful sample selection strategy.
Our classifiers are by design out-distribution aware and can thus distinguish task-related inputs from unrelated ones.
arXiv Detail & Related papers (2020-12-21T12:25:04Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.