Related papers: CleanPatrick: A Benchmark for Image Data Cleaning

CleanPatrick: A Benchmark for Image Data Cleaning

URL: http://arxiv.org/abs/2505.11034v1
Date: Fri, 16 May 2025 09:29:41 GMT
Title: CleanPatrick: A Benchmark for Image Data Cleaning
Authors: Fabian Gröger, Simone Lionetti, Philippe Gottfrois, Alvaro Gonzalez-Jimenez, Ludovic Amruthalingam, Elisabeth Victoria Goessinger, Hanna Lindemann, Marie Bargiela, Marie Hofbauer, Omar Badri, Philipp Tschandl, Arash Koochek, Matthew Groh, Alexander A. Navarini, Marc Pouly,
Abstract summary: CleanPatrick is the first large-scale benchmark for data cleaning in the image domain.<n>We collect 496,377 binary annotations from 933 medical crowd workers.<n>We employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth.
Score: 31.45060372924389
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies, limiting comparison and real-world relevance. We introduce CleanPatrick, the first large-scale benchmark for data cleaning in the image domain, built upon the publicly available Fitzpatrick17k dermatology dataset. We collect 496,377 binary annotations from 933 medical crowd workers, identify off-topic samples (4%), near-duplicates (21%), and label errors (22%), and employ an aggregation model inspired by item-response theory followed by expert review to derive high-quality ground truth. CleanPatrick formalizes issue detection as a ranking task and adopts typical ranking metrics mirroring real audit workflows. Benchmarking classical anomaly detectors, perceptual hashing, SSIM, Confident Learning, NoiseRank, and SelfClean, we find that, on CleanPatrick, self-supervised representations excel at near-duplicate detection, classical methods achieve competitive off-topic detection under constrained review budgets, and label-error detection remains an open challenge for fine-grained medical classification. By releasing both the dataset and the evaluation framework, CleanPatrick enables a systematic comparison of image-cleaning strategies and paves the way for more reliable data-centric artificial intelligence.

Related papers

When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification [11.49089004019603]
We present a comprehensive framework named REVEAL to address both noisy labels and missing labels in image classification test sets.<n> REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering.<n>Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods.
arXiv Detail & Related papers (2025-05-22T02:47:36Z)
SoftPatch+: Fully Unsupervised Anomaly Classification and Segmentation [84.07909405887696]
This paper is the first to consider fully unsupervised industrial anomaly detection (i.e., unsupervised AD with noisy data)<n>We propose memory-based unsupervised AD methods, SoftPatch and SoftPatch+, which efficiently denoise the data at the patch level.<n>Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset.<n> Comprehensive experiments conducted in diverse noise scenarios demonstrate that both SoftPatch and SoftPatch+ outperform the state-of-the-art AD methods on the MVTecAD, ViSA, and BTAD benchmarks.
arXiv Detail & Related papers (2024-12-30T11:16:49Z)
Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks. It is quite beneficial and challenging to detect poisoned samples from a mixed dataset. We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z)
SoftPatch: Unsupervised Anomaly Detection with Noisy Data [67.38948127630644]
This paper considers label-level noise in image sensory anomaly detection for the first time. We propose a memory-based unsupervised AD method, SoftPatch, which efficiently denoises the data at the patch level. Compared with existing methods, SoftPatch maintains a strong modeling ability of normal data and alleviates the overconfidence problem in coreset.
arXiv Detail & Related papers (2024-03-21T08:49:34Z)
Towards Reliable Dermatology Evaluation Benchmarks [37.464923424849964]
Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates. We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation.
arXiv Detail & Related papers (2023-09-13T13:54:32Z)
Intrinsic Self-Supervision for Data Quality Audits [35.69673085324971]
Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors. In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, or a scoring problem. We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases.
arXiv Detail & Related papers (2023-05-26T15:57:04Z)
Class Prototype-based Cleaner for Label Noise Learning [73.007001454085]
Semi-supervised learning methods are current SOTA solutions to the noisy-label learning problem. We propose a simple yet effective solution, named textbfClass textbfPrototype-based label noise textbfCleaner.
arXiv Detail & Related papers (2022-12-21T04:56:41Z)
Active label cleaning: Improving dataset quality under resource constraints [13.716577886649018]
Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models. This work advocates for a data-driven approach to prioritising samples for re-annotation. We rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy.
arXiv Detail & Related papers (2021-09-01T19:03:57Z)
Improving Medical Image Classification with Label Noise Using Dual-uncertainty Estimation [72.0276067144762]
We discuss and define the two common types of label noise in medical images. We propose an uncertainty estimation-based framework to handle these two label noise amid the medical image classification task.
arXiv Detail & Related papers (2021-02-28T14:56:45Z)
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming [65.88506015656951]
We present PClean, a probabilistic programming language for leveraging dataset-specific knowledge to clean and normalize dirty data. PClean is powered by three modeling and inference contributions: (1) a non-parametric model of relational database instances, customizable via probabilistic programs, (2) a sequential Monte Carlo inference algorithm that exploits the model's structure, and (3) near-optimal SMC proposals and blocked Gibbs rejuvenation moves constructed on a per-dataset basis.
arXiv Detail & Related papers (2020-07-23T08:01:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.