How Does Pseudo-Labeling Affect the Generalization Error of the
Semi-Supervised Gibbs Algorithm?
- URL: http://arxiv.org/abs/2210.08188v2
- Date: Thu, 15 Jun 2023 17:22:45 GMT
- Title: How Does Pseudo-Labeling Affect the Generalization Error of the
Semi-Supervised Gibbs Algorithm?
- Authors: Haiyun He, Gholamali Aminian, Yuheng Bu, Miguel Rodrigues, Vincent Y.
F. Tan
- Abstract summary: We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm.
The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset.
- Score: 73.80001705134147
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We provide an exact characterization of the expected generalization error
(gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the
Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL
information between the output hypothesis, the pseudo-labeled dataset, and the
labeled dataset. Distribution-free upper and lower bounds on the gen-error can
also be obtained. Our findings offer new insights that the generalization
performance of SSL with pseudo-labeling is affected not only by the information
between the output hypothesis and input training data but also by the
information {\em shared} between the {\em labeled} and {\em pseudo-labeled}
data samples. This serves as a guideline to choose an appropriate
pseudo-labeling method from a given family of methods. To deepen our
understanding, we further explore two examples -- mean estimation and logistic
regression. In particular, we analyze how the ratio of the number of unlabeled
to labeled data $\lambda$ affects the gen-error under both scenarios. As
$\lambda$ increases, the gen-error for mean estimation decreases and then
saturates at a value larger than when all the samples are labeled, and the gap
can be quantified {\em exactly} with our analysis, and is dependent on the
\emph{cross-covariance} between the labeled and pseudo-labeled data samples.
For logistic regression, the gen-error and the variance component of the excess
risk also decrease as $\lambda$ increases.
Related papers
- Generating Unbiased Pseudo-labels via a Theoretically Guaranteed
Chebyshev Constraint to Unify Semi-supervised Classification and Regression [57.17120203327993]
threshold-to-pseudo label process (T2L) in classification uses confidence to determine the quality of label.
In nature, regression also requires unbiased methods to generate high-quality labels.
We propose a theoretically guaranteed constraint for generating unbiased labels based on Chebyshev's inequality.
arXiv Detail & Related papers (2023-11-03T08:39:35Z) - Out-Of-Domain Unlabeled Data Improves Generalization [0.7589678255312519]
We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems.
We show that unlabeled samples can be harnessed to narrow the generalization gap.
We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.
arXiv Detail & Related papers (2023-09-29T02:00:03Z) - All Points Matter: Entropy-Regularized Distribution Alignment for
Weakly-supervised 3D Segmentation [67.30502812804271]
Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning.
We propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions.
arXiv Detail & Related papers (2023-05-25T08:19:31Z) - Pseudo-Labeling for Kernel Ridge Regression under Covariate Shift [1.3597551064547502]
We learn a regression function with small mean squared error over a target distribution, based on unlabeled data from there and labeled data that may have a different feature distribution.
We propose to split the labeled data into two subsets, and conduct kernel ridge regression on them separately to obtain a collection of candidate models and an imputation model.
Our estimator achieves the minimax optimal error rate up to a polylogarithmic factor, and we find that using pseudo-labels for model selection does not significantly hinder performance.
arXiv Detail & Related papers (2023-02-20T18:46:12Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Semi-supervised Contrastive Outlier removal for Pseudo Expectation
Maximization (SCOPE) [2.33877878310217]
We present a new approach to suppress confounding errors through a method we describe as Semi-supervised Contrastive Outlier removal for Pseudo Expectation Maximization (SCOPE)
Our results show that SCOPE greatly improves semi-supervised classification accuracy over a baseline, and furthermore when combined with consistency regularization achieves the highest reported accuracy for the semi-supervised CIFAR-10 classification task using 250 and 4000 labeled samples.
arXiv Detail & Related papers (2022-06-28T19:32:50Z) - Optimizing Diffusion Rate and Label Reliability in a Graph-Based
Semi-supervised Classifier [2.4366811507669124]
The Local and Global Consistency (LGC) algorithm is one of the most well-known graph-based semi-supervised (GSSL) classifiers.
We discuss how removing the self-influence of a labeled instance may be beneficial, and how it relates to leave-one-out error.
Within this framework, we propose methods to estimate label reliability and diffusion rate.
arXiv Detail & Related papers (2022-01-10T16:58:52Z) - Instance-Dependent Partial Label Learning [69.49681837908511]
Partial label learning is a typical weakly supervised learning problem.
Most existing approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels.
In this paper, we consider instance-dependent and assume that each example is associated with a latent label distribution constituted by the real number of each label.
arXiv Detail & Related papers (2021-10-25T12:50:26Z) - Analysis of label noise in graph-based semi-supervised learning [2.4366811507669124]
In machine learning, one must acquire labels to help supervise a model that will be able to generalize to unseen data.
It is often the case that most of our data is unlabeled.
Semi-supervised learning (SSL) alleviates that by making strong assumptions about the relation between the labels and the input data distribution.
arXiv Detail & Related papers (2020-09-27T22:13:20Z) - Self-training Avoids Using Spurious Features Under Domain Shift [54.794607791641745]
In unsupervised domain adaptation, conditional entropy minimization and pseudo-labeling work even when the domain shifts are much larger than those analyzed by existing theory.
We identify and analyze one particular setting where the domain shift can be large, but certain spurious features correlate with label in the source domain but are independent label in the target.
arXiv Detail & Related papers (2020-06-17T17:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.