On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning
- URL: http://arxiv.org/abs/2301.06010v1
- Date: Sun, 15 Jan 2023 03:21:59 GMT
- Title: On Pseudo-Labeling for Class-Mismatch Semi-Supervised Learning
- Authors: Lu Han, Han-Jia Ye, De-Chuan Zhan
- Abstract summary: In this paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL.
PL is a simple and representative SSL method that transforms SSL problems into supervised learning by creating pseudo-labels for unlabeled data.
We propose to improve PL in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling (RPL) and Semantic Exploration Clustering (SEC)
- Score: 50.48888534815361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When there are unlabeled Out-Of-Distribution (OOD) data from other classes,
Semi-Supervised Learning (SSL) methods suffer from severe performance
degradation and even get worse than merely training on labeled data. In this
paper, we empirically analyze Pseudo-Labeling (PL) in class-mismatched SSL. PL
is a simple and representative SSL method that transforms SSL problems into
supervised learning by creating pseudo-labels for unlabeled data according to
the model's prediction. We aim to answer two main questions: (1) How do OOD
data influence PL? (2) What is the proper usage of OOD data with PL? First, we
show that the major problem of PL is imbalanced pseudo-labels on OOD data.
Second, we find that OOD data can help classify In-Distribution (ID) data given
their OOD ground truth labels. Based on the findings, we propose to improve PL
in class-mismatched SSL with two components -- Re-balanced Pseudo-Labeling
(RPL) and Semantic Exploration Clustering (SEC). RPL re-balances pseudo-labels
of high-confidence data, which simultaneously filters out OOD data and
addresses the imbalance problem. SEC uses balanced clustering on low-confidence
data to create pseudo-labels on extra classes, simulating the process of
training with ground truth. Experiments show that our method achieves steady
improvement over supervised baseline and state-of-the-art performance under all
class mismatch ratios on different benchmarks.
Related papers
- A Channel-ensemble Approach: Unbiased and Low-variance Pseudo-labels is Critical for Semi-supervised Classification [61.473485511491795]
Semi-supervised learning (SSL) is a practical challenge in computer vision.
Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of The Art (SOTA) performances in SSL.
We propose a lightweight channel-based ensemble method to consolidate multiple inferior PLs into the theoretically guaranteed unbiased and low-variance one.
arXiv Detail & Related papers (2024-03-27T09:49:37Z) - FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness
for Semi-Supervised Learning [73.13448439554497]
Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data.
Most SSL methods are commonly based on instance-wise consistency between different data transformations.
We propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets.
arXiv Detail & Related papers (2023-10-25T06:57:59Z) - An Embarrassingly Simple Baseline for Imbalanced Semi-Supervised
Learning [103.65758569417702]
Semi-supervised learning (SSL) has shown great promise in leveraging unlabeled data to improve model performance.
We consider a more realistic and challenging setting called imbalanced SSL, where imbalanced class distributions occur in both labeled and unlabeled data.
We study a simple yet overlooked baseline -- SimiS -- which tackles data imbalance by simply supplementing labeled data with pseudo-labels.
arXiv Detail & Related papers (2022-11-20T21:18:41Z) - BASIL: Balanced Active Semi-supervised Learning for Class Imbalanced
Datasets [14.739359755029353]
Current semi-supervised learning (SSL) methods assume a balance between the number of data points available for each class in both the labeled and the unlabeled data sets.
We propose BASIL, a novel algorithm that optimize the submodular mutual information (SMI) functions in a per-class fashion to gradually select a balanced dataset in an active learning loop.
arXiv Detail & Related papers (2022-03-10T21:34:08Z) - On The Consistency Training for Open-Set Semi-Supervised Learning [44.046578996049654]
We study how OOD samples affect training in both low- and high-dimensional spaces.
Our method makes better use of OOD samples and achieves state-of-the-art results.
arXiv Detail & Related papers (2021-01-19T12:38:17Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - Distribution Aligning Refinery of Pseudo-label for Imbalanced
Semi-supervised Learning [126.31716228319902]
We develop Distribution Aligning Refinery of Pseudo-label (DARP) algorithm.
We show that DARP is provably and efficiently compatible with state-of-the-art SSL schemes.
arXiv Detail & Related papers (2020-07-17T09:16:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.