Informative missingness and its implications in semi-supervised learning
- URL: http://arxiv.org/abs/2512.04392v1
- Date: Thu, 04 Dec 2025 02:26:56 GMT
- Title: Informative missingness and its implications in semi-supervised learning
- Authors: Jinran Wu, You-Gan Wang, Geoffrey J. McLachlan,
- Abstract summary: Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data.<n>This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models.<n> Modelling such informative missingness offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.
- Score: 2.5794915063815664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data. It leverages information from labelled samples, whose acquisition is often costly or labour-intensive, together with unlabelled data to enhance prediction performance. This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models that can be fitted using the expectation-maximisation (EM) algorithm. Ideally, one would prefer a completely labelled sample, as one would anticipate that a labelled observation provides more information than an unlabelled one. However, when the mechanism governing label absence depends on the observed features or the class labels or both, the missingness indicators themselves contain useful information. In certain situations, the information gained from modelling the missing-label mechanism can even outweigh the loss due to missing labels, yielding a classifier with a smaller expected error than one based on a completely labelled sample analysed. This improvement arises particularly when class overlap is moderate, labelled data are sparse, and the missingness is informative. Modelling such informative missingness thus offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.
Related papers
- SSLfmm: An R Package for Semi-Supervised Learning with a Mixed-Missingness Mechanism in Finite Mixture Models [2.0253523660913664]
Semi-supervised learning (SSL) constructs classifiers from datasets in which only a subset of observations is labelled.<n>The missingness process can be informative, as the chances of an observation being unlabelled may depend on the ambiguity of its feature vector.<n>This package includes a practical tool for modelling and illustrates its performance through simulated examples.
arXiv Detail & Related papers (2025-12-03T00:14:33Z) - Some Robustness Properties of Label Cleaning [6.215814187185031]
We show that learning procedures that rely on aggregated labels enjoy robustness properties impossible without data cleaning.<n>We highlight how incorporating a fuller view of the data analysis pipeline can yield a more robust methodology by refining noisy signals.
arXiv Detail & Related papers (2025-09-14T18:17:51Z) - Generalized Semi-Supervised Learning via Self-Supervised Feature Adaptation [87.17768598044427]
Traditional semi-supervised learning assumes that the feature distributions of labeled and unlabeled data are consistent.
We propose Self-Supervised Feature Adaptation (SSFA), a generic framework for improving SSL performance when labeled and unlabeled data come from different distributions.
Our proposed SSFA is applicable to various pseudo-label-based SSL learners and significantly improves performance in labeled, unlabeled, and even unseen distributions.
arXiv Detail & Related papers (2024-05-31T03:13:45Z) - Virtual Category Learning: A Semi-Supervised Learning Method for Dense
Prediction with Extremely Limited Labels [63.16824565919966]
This paper proposes to use confusing samples proactively without label correction.
A Virtual Category (VC) is assigned to each confusing sample in such a way that it can safely contribute to the model optimisation.
Our intriguing findings highlight the usage of VC learning in dense vision tasks.
arXiv Detail & Related papers (2023-12-02T16:23:52Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism [4.675583319625962]
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models.
It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others.
We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
arXiv Detail & Related papers (2023-02-15T09:18:46Z) - Rethinking Precision of Pseudo Label: Test-Time Adaptation via
Complementary Learning [10.396596055773012]
We propose a novel complementary learning approach to enhance test-time adaptation.
In test-time adaptation tasks, information from the source domain is typically unavailable.
We highlight that the risk function of complementary labels agrees with their Vanilla loss formula.
arXiv Detail & Related papers (2023-01-15T03:36:33Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - How Does Pseudo-Labeling Affect the Generalization Error of the
Semi-Supervised Gibbs Algorithm? [73.80001705134147]
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm.
The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset.
arXiv Detail & Related papers (2022-10-15T04:11:56Z) - Complementing Semi-Supervised Learning with Uncertainty Quantification [6.612035830987296]
We propose a novel unsupervised uncertainty-aware objective that relies on aleatoric and epistemic uncertainty quantification.
Our results outperform the state-of-the-art results on complex datasets such as CIFAR-100 and Mini-ImageNet.
arXiv Detail & Related papers (2022-07-22T00:15:02Z) - Exploiting Sample Uncertainty for Domain Adaptive Person
Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels.
Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.