SSLfmm: An R Package for Semi-Supervised Learning with a Mixed-Missingness Mechanism in Finite Mixture Models
- URL: http://arxiv.org/abs/2512.03322v2
- Date: Sun, 07 Dec 2025 22:56:45 GMT
- Title: SSLfmm: An R Package for Semi-Supervised Learning with a Mixed-Missingness Mechanism in Finite Mixture Models
- Authors: Geoffrey J. McLachlan, Jinran Wu,
- Abstract summary: Semi-supervised learning (SSL) constructs classifiers from datasets in which only a subset of observations is labelled.<n>The missingness process can be informative, as the chances of an observation being unlabelled may depend on the ambiguity of its feature vector.<n>This package includes a practical tool for modelling and illustrates its performance through simulated examples.
- Score: 2.0253523660913664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semi-supervised learning (SSL) constructs classifiers from datasets in which only a subset of observations is labelled, a situation that naturally arises because obtaining labels often requires expert judgement or costly manual effort. This motivates methods that integrate labelled and unlabelled data within a learning framework. Most SSL approaches assume that label absence is harmless, typically treated as missing completely at random or ignored, but in practice, the missingness process can be informative, as the chances of an observation being unlabelled may depend on the ambiguity of its feature vector. In such cases, the missingness indicators themselves provide additional information that, if properly modelled, may improve estimation efficiency. The \textbf{SSLfmm} package for R is designed to capture this behaviour by estimating the Bayes' classifier under a finite mixture model in which each component corresponding to a class follows a multivariate normal distribution. It incorporates a mixed-missingness mechanism that combines a missing completely at random (MCAR) component with a (non-ignorable) missing at random (MAR) component, the latter modelling the probability of label missingness as a logistic function of the entropy based on the features. Parameters are estimated via an Expectation--Conditional Maximisation algorithm. In the two-class Gaussian setting with arbitrary covariance matrices, the resulting classifier trained on partially labelled data may, in some cases, achieve a lower misclassification rate than the supervised version in the case where all the labels are known. The package includes a practical tool for modelling and illustrates its performance through simulated examples.
Related papers
- MLCBART: Multilabel Classification with Bayesian Additive Regression Trees [0.6117371161379209]
Multilabel Classification deals with the simultaneous classification of multiple binary labels.<n>BART is a nonparametric and flexible model structure capable of uncovering complex relationships within the data.<n>Our adaptation, MLCBART, assumes that labels arise from thresholding an underlying numeric scale.
arXiv Detail & Related papers (2026-01-13T20:17:45Z) - Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples [11.727747752958436]
Deep learning models often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task.<n>Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions.<n>We propose a data-oriented approach to mitigate the spurious correlation in deep learning models.
arXiv Detail & Related papers (2025-12-28T10:54:51Z) - Informative missingness and its implications in semi-supervised learning [2.5794915063815664]
Semi-supervised learning (SSL) constructs classifiers using both labelled and unlabelled data.<n>This defines an incomplete-data problem, which statistically can be formulated within the likelihood framework for finite mixture models.<n> Modelling such informative missingness offers a coherent statistical framework that unifies likelihood-based inference with the behaviour of empirical SSL methods.
arXiv Detail & Related papers (2025-12-04T02:26:56Z) - Amortized Variational Inference for Partial-Label Learning: A Probabilistic Approach to Label Disambiguation [2.7214777196418645]
Partial-label learning trains classifiers when each instance is associated with a set of candidate labels, only one of which is correct.<n>We introduce a novel framework that directly approximates the posterior distribution over true labels using amortized variational inference.<n>Our method employs neural networks to predict variational parameters from input data, enabling efficient inference.
arXiv Detail & Related papers (2025-10-24T09:54:23Z) - Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels [2.384873896423002]
We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC.<n>We empirically show that the predictive distribution's location and shape are generally correct, even in the Missing Not At Random regime.
arXiv Detail & Related papers (2025-04-25T14:31:42Z) - Probably Approximately Precision and Recall Learning [60.00180898830079]
A key challenge in machine learning is the prevalence of one-sided feedback.<n>We introduce a Probably Approximately Correct (PAC) framework in which hypotheses are set functions that map each input to a set of labels.<n>We develop new algorithms that learn from positive data alone, achieving optimal sample complexity in the realizable case.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - SimPro: A Simple Probabilistic Framework Towards Realistic Long-Tailed Semi-Supervised Learning [49.94607673097326]
We propose a highly adaptable framework, designated as SimPro, which does not rely on any predefined assumptions about the distribution of unlabeled data.
Our framework, grounded in a probabilistic model, innovatively refines the expectation-maximization algorithm.
Our method showcases consistent state-of-the-art performance across diverse benchmarks and data distribution scenarios.
arXiv Detail & Related papers (2024-02-21T03:39:04Z) - CLIMAX: An exploration of Classifier-Based Contrastive Explanations [5.381004207943597]
We propose a novel post-hoc model XAI technique that provides contrastive explanations justifying the classification of a black box.
Our method, which we refer to as CLIMAX, is based on local classifiers.
We show that we achieve better consistency as compared to baselines such as LIME, BayLIME, and SLIME.
arXiv Detail & Related papers (2023-07-02T22:52:58Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Analysis of Estimating the Bayes Rule for Gaussian Mixture Models with a
Specified Missing-Data Mechanism [0.0]
Semi-supervised learning (SSL) approaches have been successfully applied in a wide range of engineering and scientific fields.
This paper investigates the generative model framework with a missingness mechanism for unclassified observations.
arXiv Detail & Related papers (2022-10-25T06:10:45Z) - Leveraging Instance Features for Label Aggregation in Programmatic Weak
Supervision [75.1860418333995]
Programmatic Weak Supervision (PWS) has emerged as a widespread paradigm to synthesize training labels efficiently.
The core component of PWS is the label model, which infers true labels by aggregating the outputs of multiple noisy supervision sources as labeling functions.
Existing statistical label models typically rely only on the outputs of LF, ignoring the instance features when modeling the underlying generative process.
arXiv Detail & Related papers (2022-10-06T07:28:53Z) - Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced
Semi-Supervised Learning [80.05441565830726]
This paper addresses imbalanced semi-supervised learning, where heavily biased pseudo-labels can harm the model performance.
We propose a general pseudo-labeling framework to address the bias motivated by this observation.
We term the novel pseudo-labeling framework for imbalanced SSL as Distribution-Aware Semantics-Oriented (DASO) Pseudo-label.
arXiv Detail & Related papers (2021-06-10T11:58:25Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.