Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information
- URL: http://arxiv.org/abs/2508.07713v1
- Date: Mon, 11 Aug 2025 07:39:20 GMT
- Title: Detecting Mislabeled and Corrupted Data via Pointwise Mutual Information
- Authors: Jinghan Yang, Jiayu Weng,
- Abstract summary: This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios.<n>We compute each sample's pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances.<n>Under label corruption, training on high-MI samples improves classification accuracy by up to 15% compared to random sampling.
- Score: 0.9821874476902969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks can memorize corrupted labels, making data quality critical for model performance, yet real-world datasets are frequently compromised by both label noise and input noise. This paper proposes a mutual information-based framework for data selection under hybrid noise scenarios that quantifies statistical dependencies between inputs and labels. We compute each sample's pointwise contribution to the overall mutual information and find that lower contributions indicate noisy or mislabeled instances. Empirical validation on MNIST with different synthetic noise settings demonstrates that the method effectively filters low-quality samples. Under label corruption, training on high-MI samples improves classification accuracy by up to 15\% compared to random sampling. Furthermore, the method exhibits robustness to benign input modifications, preserving semantically valid data while filtering truly corrupted samples.
Related papers
- Combating Noisy Labels through Fostering Self- and Neighbor-Consistency [120.4394402099635]
Label noise is pervasive in various real-world scenarios, posing challenges in supervised deep learning.<n>We propose a noise-robust method named Jo-SNC (textbfJoint sample selection and model regularization based on textbfSelf- and textbfNeighbor-textbfConsistency)<n>We design a self-adaptive, data-driven thresholding scheme to adjust per-class selection thresholds.
arXiv Detail & Related papers (2026-01-19T07:55:29Z) - Benchmarking noisy label detection methods [0.3154269505086154]
Label noise is a common problem in real-world datasets, affecting both model training and validation.<n>We perform a comprehensive benchmark of detection methods by decomposing them into three fundamental components.<n>We identify that in-sample information gathering using average probability aggregation combined with the logit margin achieves the best results.
arXiv Detail & Related papers (2025-10-17T20:55:26Z) - Extracting Clean and Balanced Subset for Noisy Long-tailed Classification [66.47809135771698]
We develop a novel pseudo labeling method using class prototypes from the perspective of distribution matching.
By setting a manually-specific probability measure, we can reduce the side-effects of noisy and long-tailed data simultaneously.
Our method can extract this class-balanced subset with clean labels, which brings effective performance gains for long-tailed classification with label noise.
arXiv Detail & Related papers (2024-04-10T07:34:37Z) - Learning with Imbalanced Noisy Data by Preventing Bias in Sample
Selection [82.43311784594384]
Real-world datasets contain not only noisy labels but also class imbalance.
We propose a simple yet effective method to address noisy labels in imbalanced datasets.
arXiv Detail & Related papers (2024-02-17T10:34:53Z) - Combating Label Noise With A General Surrogate Model For Sample Selection [77.45468386115306]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.<n>We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Differences Between Hard and Noisy-labeled Samples: An Empirical Study [7.132368785057315]
noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic.
We introduce a simple yet effective metric that filters out noisy-labeled samples while keeping the hard samples.
Our proposed data partitioning method significantly outperforms other methods when employed within a semi-supervised learning framework.
arXiv Detail & Related papers (2023-07-20T09:24:23Z) - PASS: Peer-Agreement based Sample Selection for training with Noisy Labels [16.283722126438125]
The prevalence of noisy-label samples poses a significant challenge in deep learning, inducing overfitting effects.
Current methodologies often rely on the small-loss hypothesis or feature-based selection to separate noisy- and clean-label samples.
We propose a new noisy-label detection method, termed Peer-Agreement based Sample Selection (PASS), to address this problem.
arXiv Detail & Related papers (2023-03-20T00:35:33Z) - Learning from Noisy Labels with Coarse-to-Fine Sample Credibility
Modeling [22.62790706276081]
Training deep neural network (DNN) with noisy labels is practically challenging.
Previous efforts tend to handle part or full data in a unified denoising flow.
We propose a coarse-to-fine robust learning method called CREMA to handle noisy data in a divide-and-conquer manner.
arXiv Detail & Related papers (2022-08-23T02:06:38Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z) - Assessing the Quality of the Datasets by Identifying Mislabeled Samples [14.881597737762316]
We propose a novel statistic -- noise score -- as a measure for the quality of each data point to identify mislabeled samples.
In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS)
We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets.
arXiv Detail & Related papers (2021-09-10T17:14:09Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.