Semi-Supervised Learning with Multiple Imputations on Non-Random Missing
Labels
- URL: http://arxiv.org/abs/2308.07562v1
- Date: Tue, 15 Aug 2023 04:09:53 GMT
- Title: Semi-Supervised Learning with Multiple Imputations on Non-Random Missing
Labels
- Authors: Jason Lu, Michael Ma, Huaze Xu, Zixi Xu
- Abstract summary: Semi-Supervised Learning (SSL) is implemented when algorithms are trained on both labeled and unlabeled data.
This paper proposes two new methods of combining multiple imputation models to achieve higher accuracy and less bias.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Semi-Supervised Learning (SSL) is implemented when algorithms are trained on
both labeled and unlabeled data. This is a very common application of ML as it
is unrealistic to obtain a fully labeled dataset. Researchers have tackled
three main issues: missing at random (MAR), missing completely at random
(MCAR), and missing not at random (MNAR). The MNAR problem is the most
challenging of the three as one cannot safely assume that all class
distributions are equal. Existing methods, including Class-Aware Imputation
(CAI) and Class-Aware Propensity (CAP), mostly overlook the non-randomness in
the unlabeled data. This paper proposes two new methods of combining multiple
imputation models to achieve higher accuracy and less bias. 1) We use multiple
imputation models, create confidence intervals, and apply a threshold to ignore
pseudo-labels with low confidence. 2) Our new method, SSL with De-biased
Imputations (SSL-DI), aims to reduce bias by filtering out inaccurate data and
finding a subset that is accurate and reliable. This subset of the larger
dataset could be imputed into another SSL model, which will be less biased. The
proposed models have been shown to be effective in both MCAR and MNAR
situations, and experimental results show that our methodology outperforms
existing methods in terms of classification accuracy and reducing bias.
Related papers
- Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Adaptive Negative Evidential Deep Learning for Open-set Semi-supervised Learning [69.81438976273866]
Open-set semi-supervised learning (Open-set SSL) considers a more practical scenario, where unlabeled data and test data contain new categories (outliers) not observed in labeled data (inliers)
We introduce evidential deep learning (EDL) as an outlier detector to quantify different types of uncertainty, and design different uncertainty metrics for self-training and inference.
We propose a novel adaptive negative optimization strategy, making EDL more tailored to the unlabeled dataset containing both inliers and outliers.
arXiv Detail & Related papers (2023-03-21T09:07:15Z) - Land Cover and Land Use Detection using Semi-Supervised Learning [0.0]
We create "artificial" labels and train a model to have reasonable accuracy.
We use a variety of class imbalanced satellite image datasets: EuroSAT, UCM, and WHU-RS19.
Our approach significantly lessens the requirement for labeled data, consistently outperforms alternative approaches, and resolves the issue of model bias caused by class imbalance in datasets.
arXiv Detail & Related papers (2022-12-21T17:36:28Z) - Information FOMO: The unhealthy fear of missing out on information. A method for removing misleading data for healthier models [0.0]
Misleading or unnecessary data can have out-sized impacts on the health or accuracy of Machine Learning (ML) models.
We present a sequential selection method that identifies critically important information within a dataset.
We find these instabilities are a result of the complexity of the underlying map and linked to extreme events and heavy tails.
arXiv Detail & Related papers (2022-08-27T19:43:53Z) - On Non-Random Missing Labels in Semi-Supervised Learning [114.62655062520425]
Semi-Supervised Learning (SSL) is fundamentally a missing label problem.
We explicitly incorporate "class" into SSL.
Our method not only significantly outperforms existing baselines but also surpasses other label bias removal SSL methods.
arXiv Detail & Related papers (2022-06-29T22:01:29Z) - Few-Shot Non-Parametric Learning with Deep Latent Variable Model [50.746273235463754]
We propose Non-Parametric learning by Compression with Latent Variables (NPC-LV)
NPC-LV is a learning framework for any dataset with abundant unlabeled data but very few labeled ones.
We show that NPC-LV outperforms supervised methods on all three datasets on image classification in low data regime.
arXiv Detail & Related papers (2022-06-23T09:35:03Z) - BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced
Datasets [14.739359755029353]
Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets.
We propose BASIL, a novel algorithm that optimize the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop.
arXiv Detail & Related papers (2022-03-10T21:34:08Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - OpenMatch: Open-set Consistency Regularization for Semi-supervised
Learning with Outliers [71.08167292329028]
We propose a novel Open-set Semi-Supervised Learning (OSSL) approach called OpenMatch.
OpenMatch unifies FixMatch with novelty detection based on one-vs-all (OVA) classifiers.
It achieves state-of-the-art performance on three datasets, and even outperforms a fully supervised model in detecting outliers unseen in unlabeled data on CIFAR10.
arXiv Detail & Related papers (2021-05-28T23:57:15Z) - PLM: Partial Label Masking for Imbalanced Multi-label Classification [59.68444804243782]
Neural networks trained on real-world datasets with long-tailed label distributions are biased towards frequent classes and perform poorly on infrequent classes.
We propose a method, Partial Label Masking (PLM), which utilizes this ratio during training.
Our method achieves strong performance when compared to existing methods on both multi-label (MultiMNIST and MSCOCO) and single-label (imbalanced CIFAR-10 and CIFAR-100) image classification datasets.
arXiv Detail & Related papers (2021-05-22T18:07:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.