Improved Naive Bayes with Mislabeled Data
- URL: http://arxiv.org/abs/2304.06292v1
- Date: Thu, 13 Apr 2023 06:52:07 GMT
- Title: Improved Naive Bayes with Mislabeled Data
- Authors: Qianhan Zeng, Yingqiu Zhu, Xuening Zhu, Feifei Wang, Weichen Zhao,
Shuning Sun, Meng Su, Hansheng Wang
- Abstract summary: We propose an improved Naive Bayes method for text classification.
It is analytically simple and free of subjective judgements on the correct and incorrect labels.
Our simulation and experiment results show that the improved Naive Bayes method greatly improves the performances of the Naive Bayes method with mislabeled data.
- Score: 0.48372723204747653
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Labeling mistakes are frequently encountered in real-world applications. If
not treated well, the labeling mistakes can deteriorate the classification
performances of a model seriously. To address this issue, we propose an
improved Naive Bayes method for text classification. It is analytically simple
and free of subjective judgements on the correct and incorrect labels. By
specifying the generating mechanism of incorrect labels, we optimize the
corresponding log-likelihood function iteratively by using an EM algorithm. Our
simulation and experiment results show that the improved Naive Bayes method
greatly improves the performances of the Naive Bayes method with mislabeled
data.
Related papers
- Data-Driven Estimation of the False Positive Rate of the Bayes Binary
Classifier via Soft Labels [25.40796153743837]
We propose an estimator for the false positive rate (FPR) of the Bayes classifier, that is, the optimal classifier with respect to accuracy, from a given dataset.
We develop effective FPR estimators by leveraging a denoising technique and the Nadaraya-Watson estimator.
arXiv Detail & Related papers (2024-01-27T20:41:55Z) - Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech
Recognition [49.42732949233184]
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition.
Taking noisy labels as ground-truth in the loss function results in suboptimal performance.
We propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels.
arXiv Detail & Related papers (2023-08-12T12:13:52Z) - SoftMatch: Addressing the Quantity-Quality Trade-off in Semi-supervised
Learning [101.86916775218403]
This paper revisits the popular pseudo-labeling methods via a unified sample weighting formulation.
We propose SoftMatch to overcome the trade-off by maintaining both high quantity and high quality of pseudo-labels during training.
In experiments, SoftMatch shows substantial improvements across a wide variety of benchmarks, including image, text, and imbalanced classification.
arXiv Detail & Related papers (2023-01-26T03:53:25Z) - Filter and evolve: progressive pseudo label refining for semi-supervised
automatic speech recognition [5.735000563764309]
Low quality pseudo labels can misguide decision boundaries and degrade performance.
We propose a simple yet effective strategy to filter low quality pseudo labels.
Experiments on LibriSpeech show that these filtered samples enable the refined model to yield more correct predictions.
arXiv Detail & Related papers (2022-10-28T16:15:58Z) - Active Learning by Feature Mixing [52.16150629234465]
We propose a novel method for batch active learning called ALFA-Mix.
We identify unlabelled instances with sufficiently-distinct features by seeking inconsistencies in predictions.
We show that inconsistencies in these predictions help discovering features that the model is unable to recognise in the unlabelled instances.
arXiv Detail & Related papers (2022-03-14T12:20:54Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Learning with Noisy Labels by Efficient Transition Matrix Estimation to
Combat Label Miscorrection [3.48062110627933]
Recent studies on learning with noisy labels have shown remarkable performance by exploiting a small clean dataset.
Model meta-learning-based label correction methods further improve performance by correcting noisy labels on the fly.
However, there is no safeguard on the label miscorrection, resulting in unavoidable performance degradation.
We propose a robust and efficient method that learns a label transition matrix on the fly.
arXiv Detail & Related papers (2021-11-29T20:12:17Z) - A Novel Perspective for Positive-Unlabeled Learning via Noisy Labels [49.990938653249415]
This research presents a methodology that assigns initial pseudo-labels to unlabeled data which is used as noisy-labeled data, and trains a deep neural network using the noisy-labeled data.
Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-03-08T11:46:02Z) - In Defense of Pseudo-Labeling: An Uncertainty-Aware Pseudo-label
Selection Framework for Semi-Supervised Learning [53.1047775185362]
Pseudo-labeling (PL) is a general SSL approach that does not have this constraint but performs relatively poorly in its original formulation.
We argue that PL underperforms due to the erroneous high confidence predictions from poorly calibrated models.
We propose an uncertainty-aware pseudo-label selection (UPS) framework which improves pseudo labeling accuracy by drastically reducing the amount of noise encountered in the training process.
arXiv Detail & Related papers (2021-01-15T23:29:57Z) - Improving Generalization of Deep Fault Detection Models in the Presence
of Mislabeled Data [1.3535770763481902]
We propose a novel two-step framework for robust training with label noise.
In the first step, we identify outliers (including the mislabeled samples) based on the update in the hypothesis space.
In the second step, we propose different approaches to modifying the training data based on the identified outliers and a data augmentation technique.
arXiv Detail & Related papers (2020-09-30T12:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.