Related papers: Robust Data Pruning: Uncovering and Overcoming Implicit Bias

Robust Data Pruning: Uncovering and Overcoming Implicit Bias

URL: http://arxiv.org/abs/2404.05579v1
Date: Mon, 8 Apr 2024 14:55:35 GMT
Title: Robust Data Pruning: Uncovering and Overcoming Implicit Bias
Authors: Artem Vysogorets, Kartik Ahuja, Julia Kempe,
Abstract summary: We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We propose a "fairness-aware" approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks.
Score: 11.930434318557156
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. At the same time, we argue that random data pruning with appropriate class ratios has potential to improve the worst-class performance. We propose a "fairness-aware" approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving robustness at a tolerable drop of average performance as we prune more from the datasets. We present theoretical analysis of the classification risk in a mixture of Gaussians to further motivate our algorithm and support our findings.

Related papers

Improving Model Classification by Optimizing the Training Dataset [3.987352341101438]
Coresets offer a principled approach to data reduction, enabling efficient learning on large datasets.<n>We present a systematic framework for tuning the coreset generation process to enhance downstream classification quality.
arXiv Detail & Related papers (2025-07-22T16:10:11Z)
AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training [33.01500681857408]
We introduce Adaptive De-Duplication (AdaDeDup), a novel framework that integrates density-based pruning with model-informed feedback in a cluster-adaptive manner.<n>It significantly outperforms prominent baselines, substantially reduces performance degradation, and achieves near-original model performance while pruning 20% of data.
arXiv Detail & Related papers (2025-06-24T22:35:51Z)
Geometric Median Matching for Robust k-Subset Selection from Noisy Data [75.86423267723728]
We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
arXiv Detail & Related papers (2025-04-01T09:22:05Z)
Data Pruning in Generative Diffusion Models [2.0111637969968]
Generative models aim to estimate the underlying distribution of the data, so presumably they should benefit from larger datasets. We show that eliminating redundant or noisy data in large datasets is beneficial particularly when done strategically.
arXiv Detail & Related papers (2024-11-19T14:13:25Z)
Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance. Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z)
Restoring balance: principled under/oversampling of data for optimal classification [0.0]
Class imbalance in real-world data poses a common bottleneck for machine learning tasks. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically. We provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered.
arXiv Detail & Related papers (2024-05-15T17:45:34Z)
PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin) We propose PUMA, a new data pruning strategy that computes the margin using DeepFool. We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z)
An Adaptive Cost-Sensitive Learning and Recursive Denoising Framework for Imbalanced SVM Classification [12.986535715303331]
Category imbalance is one of the most popular and important issues in the domain of classification. Emotion classification model trained on imbalanced datasets easily leads to unreliable prediction.
arXiv Detail & Related papers (2024-03-13T09:43:14Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Towards Accelerated Model Training via Bayesian Data Selection [45.62338106716745]
We propose a more reasonable data selection principle by examining the data's impact on the model's generalization loss. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-21T07:58:15Z)
Towards Better Certified Segmentation via Diffusion Models [62.21617614504225]
segmentation models can be vulnerable to adversarial perturbations, which hinders their use in critical-decision systems like healthcare or autonomous driving. Recently, randomized smoothing has been proposed to certify segmentation predictions by adding Gaussian noise to the input to obtain theoretical guarantees. In this paper, we address the problem of certifying segmentation prediction using a combination of randomized smoothing and diffusion models.
arXiv Detail & Related papers (2023-06-16T16:30:39Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Adaptive Dimension Reduction and Variational Inference for Transductive Few-Shot Classification [2.922007656878633]
We propose a new clustering method based on Variational Bayesian inference, further improved by Adaptive Dimension Reduction. Our proposed method significantly improves accuracy in the realistic unbalanced transductive setting on various Few-Shot benchmarks.
arXiv Detail & Related papers (2022-09-18T10:29:02Z)
Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm. Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z)
Last Layer Marginal Likelihood for Invariance Learning [12.00078928875924]
We introduce a new lower bound to the marginal likelihood, which allows us to perform inference for a larger class of likelihood functions. We work towards bringing this approach to neural networks by using an architecture with a Gaussian process in the last layer.
arXiv Detail & Related papers (2021-06-14T15:40:51Z)
Adversarial Robustness via Fisher-Rao Regularization [33.134075068748984]
Adrial robustness has become a topic of growing interest in machine learning. Fire is a new Fisher-Rao regularization for the categorical cross-entropy loss.
arXiv Detail & Related papers (2021-06-12T04:12:58Z)
Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.