CAST: Cluster-Aware Self-Training for Tabular Data
- URL: http://arxiv.org/abs/2310.06380v2
- Date: Fri, 2 Feb 2024 17:31:05 GMT
- Title: CAST: Cluster-Aware Self-Training for Tabular Data
- Authors: Minwook Kim, Juseong Kim, Ki Beom Kim, Giltae Song
- Abstract summary: Self-training is vulnerable to noisy pseudo-labels caused by erroneous confidence.
Cluster-Aware Self-Training (CAST) enhances existing self-training algorithms at a negligible cost without significant modifications.
- Score: 0.5461938536945723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-training has gained attraction because of its simplicity and
versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous
confidence. Several solutions have been proposed to handle the problem, but
they require significant modifications in self-training algorithms or model
architecture, and most have limited applicability in tabular domains. To
address this issue, we explore a novel direction of reliable confidence in
self-training contexts and conclude that the confidence, which represents the
value of the pseudo-label, should be aware of the cluster assumption. In this
regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which
enhances existing self-training algorithms at a negligible cost without
significant modifications. Concretely, CAST regularizes the confidence of the
classifier by leveraging local density for each class in the labeled training
data, forcing the pseudo-labels in low-density regions to have lower
confidence. Extensive empirical evaluations on up to 21 real-world datasets
confirm not only the superior performance of CAST but also its robustness in
various setups in self-training contexts.
Related papers
- Feedback-Driven Pseudo-Label Reliability Assessment: Redefining Thresholding for Semi-Supervised Semantic Segmentation [5.7977777220041204]
A common practice in pseudo-supervision is filtering pseudo-labels based on pre-defined confidence thresholds or entropy.<n>We propose Ensemble-of-Confidence Reinforcement (ENCORE), a dynamic feedback-driven thresholding strategy for pseudo-label selection.<n>Our method seamlessly integrates into existing pseudo-supervision frameworks and significantly improves segmentation performance.
arXiv Detail & Related papers (2025-05-12T15:58:08Z) - CALICO: Confident Active Learning with Integrated Calibration [11.978551396144532]
We propose an AL framework that self-calibrates the confidence used for sample selection during the training process.
We show improved classification performance compared to a softmax-based classifier with fewer labeled samples.
arXiv Detail & Related papers (2024-07-02T15:05:19Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - A Confidence-based Partial Label Learning Model for Crowd-Annotated
Named Entity Recognition [74.79785063365289]
Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets.
We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.
arXiv Detail & Related papers (2023-05-21T15:31:23Z) - Revisiting Self-Training with Regularized Pseudo-Labeling for Tabular
Data [0.0]
We revisit self-training which can be applied to any kind of algorithm including gradient boosting decision tree.
We propose a novel pseudo-labeling approach that regularizes the confidence scores based on the likelihoods of the pseudo-labels.
arXiv Detail & Related papers (2023-02-27T18:12:56Z) - Uncertainty-aware Self-training for Low-resource Neural Sequence
Labeling [29.744621356187764]
This paper presents SeqUST, a novel uncertain-aware self-training framework for Neural sequence labeling (NSL)
We incorporate Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation at the token level and then select reliable language tokens from unlabeled data.
A well-designed masked sequence labeling task with a noise-robust loss supports robust training, which aims to suppress the problem of noisy pseudo labels.
arXiv Detail & Related papers (2023-02-17T02:40:04Z) - Confident Sinkhorn Allocation for Pseudo-Labeling [40.883130133661304]
Semi-supervised learning is a critical tool in reducing machine learning's dependence on labeled data.
This paper studies theoretically the role of uncertainty to pseudo-labeling and proposes Confident Sinkhorn Allocation (CSA)
CSA identifies the best pseudo-label allocation via optimal transport to only samples with high confidence scores.
arXiv Detail & Related papers (2022-06-13T02:16:26Z) - Semi-Supervised Learning of Semantic Correspondence with Pseudo-Labels [26.542718087103665]
SemiMatch is a semi-supervised solution for establishing dense correspondences across semantically similar images.
Our framework generates the pseudo-labels using the model's prediction itself between source and weakly-augmented target, and uses pseudo-labels to learn the model again between source and strongly-augmented target.
In experiments, SemiMatch achieves state-of-the-art performance on various benchmarks, especially on PF-Willow by a large margin.
arXiv Detail & Related papers (2022-03-30T03:52:50Z) - Cycle Self-Training for Domain Adaptation [85.14659717421533]
Cycle Self-Training (CST) is a principled self-training algorithm that enforces pseudo-labels to generalize across domains.
CST recovers target ground truth, while both invariant feature learning and vanilla self-training fail.
Empirical results indicate that CST significantly improves over prior state-of-the-arts in standard UDA benchmarks.
arXiv Detail & Related papers (2021-03-05T10:04:25Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.