Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning
- URL: http://arxiv.org/abs/2510.03993v5
- Date: Mon, 03 Nov 2025 12:24:52 GMT
- Title: Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning
- Authors: Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou,
- Abstract summary: We propose a Controllable Pseudo-label Generation (CPG) framework to expand the labeled dataset with reliable pseudo-labels from the unlabeled dataset.<n>CPG operates through a controllable self-reinforcing optimization cycle.<n>CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $textbf15.97%$ in accuracy.
- Score: 88.48555005545694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.
Related papers
- Source-Free Domain Adaptive Semantic Segmentation of Remote Sensing Images with Diffusion-Guided Label Enrichment [18.05142463882686]
Self-training has been widely used in SFDA, which requires obtaining as many high-quality pseudo-labels as possible to train models on target domain data.<n>We propose a novel pseudo-label optimization framework called Diffusion-Guided Label Enrichment (DGLE)<n>It starts from a few easily obtained high-quality pseudo-labels and propagates them to a complete set of pseudo-labels while ensuring the quality of newly generated labels.
arXiv Detail & Related papers (2025-09-23T01:10:01Z) - Improving realistic semi-supervised learning with doubly robust estimation [8.828699635463265]
A major challenge in Semi-Supervised Learning (SSL) is the limited information available about the class distribution in the unlabeled data.<n>We propose to explicitly estimate the unlabeled class distribution, which is a finite-dimensional parameter, emphas an initial step, using a doubly robust estimator with a strong theoretical guarantee.<n>This estimate can then be integrated into existing methods to pseudo-label the unlabeled data during training more accurately.
arXiv Detail & Related papers (2025-02-01T02:34:12Z) - Continuous Contrastive Learning for Long-Tailed Semi-Supervised Recognition [50.61991746981703]
Current state-of-the-art LTSSL approaches rely on high-quality pseudo-labels for large-scale unlabeled data.
This paper introduces a novel probabilistic framework that unifies various recent proposals in long-tail learning.
We introduce a continuous contrastive learning method, CCL, extending our framework to unlabeled data using reliable and smoothed pseudo-labels.
arXiv Detail & Related papers (2024-10-08T15:06:10Z) - All Points Matter: Entropy-Regularized Distribution Alignment for
Weakly-supervised 3D Segmentation [67.30502812804271]
Pseudo-labels are widely employed in weakly supervised 3D segmentation tasks where only sparse ground-truth labels are available for learning.
We propose a novel learning strategy to regularize the generated pseudo-labels and effectively narrow the gaps between pseudo-labels and model predictions.
arXiv Detail & Related papers (2023-05-25T08:19:31Z) - Revisiting Self-Training with Regularized Pseudo-Labeling for Tabular
Data [0.0]
We revisit self-training which can be applied to any kind of algorithm including gradient boosting decision tree.
We propose a novel pseudo-labeling approach that regularizes the confidence scores based on the likelihoods of the pseudo-labels.
arXiv Detail & Related papers (2023-02-27T18:12:56Z) - Dist-PU: Positive-Unlabeled Learning from a Label Distribution
Perspective [89.5370481649529]
We propose a label distribution perspective for PU learning in this paper.
Motivated by this, we propose to pursue the label distribution consistency between predicted and ground-truth label distributions.
Experiments on three benchmark datasets validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-06T07:38:29Z) - Re-distributing Biased Pseudo Labels for Semi-supervised Semantic
Segmentation: A Baseline Investigation [30.688753736660725]
We present a simple and yet effective Distribution Alignment and Random Sampling (DARS) method to produce unbiased pseudo labels.
Our method performs favorably in comparison with state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-23T14:45:14Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - Out-distribution aware Self-training in an Open World Setting [62.19882458285749]
We leverage unlabeled data in an open world setting to further improve prediction performance.
We introduce out-distribution aware self-training, which includes a careful sample selection strategy.
Our classifiers are by design out-distribution aware and can thus distinguish task-related inputs from unrelated ones.
arXiv Detail & Related papers (2020-12-21T12:25:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.