CAST: Cluster-Aware Self-Training for Tabular Data
- URL: http://arxiv.org/abs/2310.06380v2
- Date: Fri, 2 Feb 2024 17:31:05 GMT
- Title: CAST: Cluster-Aware Self-Training for Tabular Data
- Authors: Minwook Kim, Juseong Kim, Ki Beom Kim, Giltae Song
- Abstract summary: Self-training is vulnerable to noisy pseudo-labels caused by erroneous confidence.
Cluster-Aware Self-Training (CAST) enhances existing self-training algorithms at a negligible cost without significant modifications.
- Score: 0.5461938536945723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-training has gained attraction because of its simplicity and
versatility, yet it is vulnerable to noisy pseudo-labels caused by erroneous
confidence. Several solutions have been proposed to handle the problem, but
they require significant modifications in self-training algorithms or model
architecture, and most have limited applicability in tabular domains. To
address this issue, we explore a novel direction of reliable confidence in
self-training contexts and conclude that the confidence, which represents the
value of the pseudo-label, should be aware of the cluster assumption. In this
regard, we propose Cluster-Aware Self-Training (CAST) for tabular data, which
enhances existing self-training algorithms at a negligible cost without
significant modifications. Concretely, CAST regularizes the confidence of the
classifier by leveraging local density for each class in the labeled training
data, forcing the pseudo-labels in low-density regions to have lower
confidence. Extensive empirical evaluations on up to 21 real-world datasets
confirm not only the superior performance of CAST but also its robustness in
various setups in self-training contexts.
Related papers
- CALICO: Confident Active Learning with Integrated Calibration [11.978551396144532]
We propose an AL framework that self-calibrates the confidence used for sample selection during the training process.
We show improved classification performance compared to a softmax-based classifier with fewer labeled samples.
arXiv Detail & Related papers (2024-07-02T15:05:19Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Binary Classification with Confidence Difference [100.08818204756093]
This paper delves into a novel weakly supervised binary classification problem called confidence-difference (ConfDiff) classification.
We propose a risk-consistent approach to tackle this problem and show that the estimation error bound the optimal convergence rate.
We also introduce a risk correction approach to mitigate overfitting problems, whose consistency and convergence rate are also proven.
arXiv Detail & Related papers (2023-10-09T11:44:50Z) - Uncertainty-aware Self-training for Low-resource Neural Sequence
Labeling [29.744621356187764]
This paper presents SeqUST, a novel uncertain-aware self-training framework for Neural sequence labeling (NSL)
We incorporate Monte Carlo (MC) dropout in Bayesian neural network (BNN) to perform uncertainty estimation at the token level and then select reliable language tokens from unlabeled data.
A well-designed masked sequence labeling task with a noise-robust loss supports robust training, which aims to suppress the problem of noisy pseudo labels.
arXiv Detail & Related papers (2023-02-17T02:40:04Z) - Cycle Self-Training for Domain Adaptation [85.14659717421533]
Cycle Self-Training (CST) is a principled self-training algorithm that enforces pseudo-labels to generalize across domains.
CST recovers target ground truth, while both invariant feature learning and vanilla self-training fail.
Empirical results indicate that CST significantly improves over prior state-of-the-arts in standard UDA benchmarks.
arXiv Detail & Related papers (2021-03-05T10:04:25Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - Two-phase Pseudo Label Densification for Self-training based Domain
Adaptation [93.03265290594278]
We propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD.
In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images.
In the second phase, we perform a confidence-based easy-hard classification.
To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss.
arXiv Detail & Related papers (2020-12-09T02:35:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.