Related papers: Semi-Supervised Learning with Multiple Imputations on Non-Random Missing Labels

Semi-Supervised Learning with Multiple Imputations on Non-Random Missing Labels

URL: http://arxiv.org/abs/2308.07562v1
Date: Tue, 15 Aug 2023 04:09:53 GMT
Title: Semi-Supervised Learning with Multiple Imputations on Non-Random Missing Labels
Authors: Jason Lu, Michael Ma, Huaze Xu, Zixi Xu
Abstract summary: Semi-Supervised Learning (SSL) is implemented when algorithms are trained on both labeled and unlabeled data. This paper proposes two new methods of combining multiple imputation models to achieve higher accuracy and less bias.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Semi-Supervised Learning (SSL) is implemented when algorithms are trained on both labeled and unlabeled data. This is a very common application of ML as it is unrealistic to obtain a fully labeled dataset. Researchers have tackled three main issues: missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR). The MNAR problem is the most challenging of the three as one cannot safely assume that all class distributions are equal. Existing methods, including Class-Aware Imputation (CAI) and Class-Aware Propensity (CAP), mostly overlook the non-randomness in the unlabeled data. This paper proposes two new methods of combining multiple imputation models to achieve higher accuracy and less bias. 1) We use multiple imputation models, create confidence intervals, and apply a threshold to ignore pseudo-labels with low confidence. 2) Our new method, SSL with De-biased Imputations (SSL-DI), aims to reduce bias by filtering out inaccurate data and finding a subset that is accurate and reliable. This subset of the larger dataset could be imputed into another SSL model, which will be less biased. The proposed models have been shown to be effective in both MCAR and MNAR situations, and experimental results show that our methodology outperforms existing methods in terms of classification accuracy and reducing bias.

Related papers

Diversify and Conquer: Open-set Disagreement for Robust Semi-supervised Learning with Outliers [27.080247169267288]
Unlabeled data often includes unknown class data, i.e., outliers.<n>We propose a novel framework, Diversify and Conquer (DAC), to enhance SSL robustness.<n>Our key contribution is constructing a collection of differently biased models through a single training process.
arXiv Detail & Related papers (2025-05-30T10:24:30Z)
Evaluating multiple models using labeled and unlabeled data [8.174722982389259]
Semi-Supervised Model Evaluation (SSME) is a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method.
arXiv Detail & Related papers (2025-01-21T03:47:37Z)
Boosting Semi-Supervised Learning by bridging high and low-confidence predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL) We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z)
Adaptive Negative Evidential Deep Learning for Open-set Semi-supervised Learning [69.81438976273866]
Open-set semi-supervised learning (Open-set SSL) considers a more practical scenario, where unlabeled data and test data contain new categories (outliers) not observed in labeled data (inliers) We introduce evidential deep learning (EDL) as an outlier detector to quantify different types of uncertainty, and design different uncertainty metrics for self-training and inference. We propose a novel adaptive negative optimization strategy, making EDL more tailored to the unlabeled dataset containing both inliers and outliers.
arXiv Detail & Related papers (2023-03-21T09:07:15Z)
Land Cover and Land Use Detection using Semi-Supervised Learning [0.0]
We create "artificial" labels and train a model to have reasonable accuracy. We use a variety of class imbalanced satellite image datasets: EuroSAT, UCM, and WHU-RS19. Our approach significantly lessens the requirement for labeled data, consistently outperforms alternative approaches, and resolves the issue of model bias caused by class imbalance in datasets.
arXiv Detail & Related papers (2022-12-21T17:36:28Z)
Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models. We present a sequential selection method that identifies critically important information within a dataset. We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z)
On Non-Random Missing Labels in Semi-Supervised Learning [114.62655062520425]
Semi-Supervised Learning (SSL) is fundamentally a missing label problem. We explicitly incorporate "class" into SSL. Our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods.
arXiv Detail & Related papers (2022-06-29T22:01:29Z)
Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV) NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones. We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z)
BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced Datasets [14.739359755029353]
Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets. We propose BASIL, a novel algorithm that optimize the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop.
arXiv Detail & Related papers (2022-03-10T21:34:08Z)
Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models. Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z)
OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers [71.08167292329028]
We propose a novel Open-set Semi-Supervised Learning (OSSL) approach called OpenMatch. OpenMatch unifies FixMatch with novelty detection based on one-vs-all (OVA) classifiers. It achieves state-of-the-art performance on three datasets, and even outperforms a fully supervised model in detecting outliers unseen in unlabeled data on CIFAR10.
arXiv Detail & Related papers (2021-05-28T23:57:15Z)
PLM: Partial Label Masking for Imbalanced Multi-label Classification [59.68444804243782]
Neural networks trained on real-world datasets with long-tailed label distributions are biased towards frequent classes and perform poorly on infrequent classes. We propose a method, Partial Label Masking (PLM), which utilizes this ratio during training. Our method achieves strong performance when compared to existing methods on both multi-label (MultiMNIST and MSCOCO) and single-label (imbalanced CIFAR-10 and CIFAR-100) image classification datasets.
arXiv Detail & Related papers (2021-05-22T18:07:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.