Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism
- URL: http://arxiv.org/abs/2302.07540v1
- Date: Wed, 15 Feb 2023 09:18:46 GMT
- Title: Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism
- Authors: Aude Sportisse (CRISAM,3iA C\^ote d'Azur, MAASAI, UCA), Hugo Schmutz
(CRISAM, TIRO-MATOs, JAD,3iA C\^ote d'Azur, MAASAI, UCA), Olivier Humbert
(UNICANCER/CAL, TIRO-MATOs, UCA), Charles Bouveyron (MAASAI, CRISAM,3iA
C\^ote d'Azur, UCA), Pierre-Alexandre Mattei (MAASAI, CRISAM,3iA C\^ote
d'Azur, UCA)
- Abstract summary: Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models.
It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others.
We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
- Score: 4.675583319625962
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semi-supervised learning is a powerful technique for leveraging unlabeled
data to improve machine learning models, but it can be affected by the presence
of ``informative'' labels, which occur when some classes are more likely to be
labeled than others. In the missing data literature, such labels are called
missing not at random. In this paper, we propose a novel approach to address
this issue by estimating the missing-data mechanism and using inverse
propensity weighting to debias any SSL algorithm, including those using data
augmentation. We also propose a likelihood ratio test to assess whether or not
labels are indeed informative. Finally, we demonstrate the performance of the
proposed methods on different datasets, in particular on two medical datasets
for which we design pseudo-realistic missing data scenarios.
Related papers
- Deep Active Learning with Manifold-preserving Trajectory Sampling [2.0717982775472206]
Active learning (AL) is for optimizing the selection of unlabeled data for annotation (labeling)
Existing deep AL methods arguably suffer from bias incurred by clabeled data, which takes a much lower percentage than unlabeled data in AL context.
We propose a novel method, namely Manifold-Preserving Trajectory Sampling (MPTS), aiming to enforce the feature space learned from labeled data to represent a more accurate manifold.
arXiv Detail & Related papers (2024-10-21T03:04:09Z) - FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness
for Semi-Supervised Learning [73.13448439554497]
Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data.
Most SSL methods are commonly based on instance-wise consistency between different data transformations.
We propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets.
arXiv Detail & Related papers (2023-10-25T06:57:59Z) - Soft Curriculum for Learning Conditional GANs with Noisy-Labeled and
Uncurated Unlabeled Data [70.25049762295193]
We introduce a novel conditional image generation framework that accepts noisy-labeled and uncurated data during training.
We propose soft curriculum learning, which assigns instance-wise weights for adversarial training while assigning new labels for unlabeled data.
Our experiments show that our approach outperforms existing semi-supervised and label-noise robust methods in terms of both quantitative and qualitative performance.
arXiv Detail & Related papers (2023-07-17T08:31:59Z) - Adaptive Negative Evidential Deep Learning for Open-set Semi-supervised Learning [69.81438976273866]
Open-set semi-supervised learning (Open-set SSL) considers a more practical scenario, where unlabeled data and test data contain new categories (outliers) not observed in labeled data (inliers)
We introduce evidential deep learning (EDL) as an outlier detector to quantify different types of uncertainty, and design different uncertainty metrics for self-training and inference.
We propose a novel adaptive negative optimization strategy, making EDL more tailored to the unlabeled dataset containing both inliers and outliers.
arXiv Detail & Related papers (2023-03-21T09:07:15Z) - Weighted Distillation with Unlabeled Examples [15.825078347452024]
Distillation with unlabeled examples is a popular and powerful method for training deep neural networks in settings where the amount of labeled data is limited.
This paper proposes a principled approach for addressing this issue based on a ''debiasing'' reweighting of the student's loss function tailored to the distillation training paradigm.
arXiv Detail & Related papers (2022-10-13T04:08:56Z) - AuxMix: Semi-Supervised Learning with Unconstrained Unlabeled Data [6.633920993895286]
We show that state-of-the-art SSL algorithms suffer a degradation in performance in the presence of unlabeled auxiliary data.
We propose AuxMix, an algorithm that leverages self-supervised learning tasks to learn generic features in order to mask auxiliary data that are not semantically similar to the labeled set.
arXiv Detail & Related papers (2022-06-14T16:25:20Z) - OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set
Unlabeled Data [65.19205979542305]
Unlabeled data may include out-of-class samples in practice.
OpenCoS is a method for handling this realistic semi-supervised learning scenario.
arXiv Detail & Related papers (2021-06-29T06:10:05Z) - A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data
Streams [10.370629574634092]
This survey pays special attention to methods that leverage unlabelled data in a semi-supervised setting.
We discuss the delayed labelling issue, which impacts both fully supervised and semi-supervised methods.
arXiv Detail & Related papers (2021-06-16T23:14:20Z) - A Novel Perspective for Positive-Unlabeled Learning via Noisy Labels [49.990938653249415]
This research presents a methodology that assigns initial pseudo-labels to unlabeled data which is used as noisy-labeled data, and trains a deep neural network using the noisy-labeled data.
Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-03-08T11:46:02Z) - Learning with Out-of-Distribution Data for Audio Classification [60.48251022280506]
We show that detecting and relabelling certain OOD instances, rather than discarding them, can have a positive effect on learning.
The proposed method is shown to improve the performance of convolutional neural networks by a significant margin.
arXiv Detail & Related papers (2020-02-11T21:08:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.