Geometric Median Matching for Robust k-Subset Selection from Noisy Data
- URL: http://arxiv.org/abs/2504.00564v2
- Date: Thu, 03 Apr 2025 11:12:07 GMT
- Title: Geometric Median Matching for Robust k-Subset Selection from Noisy Data
- Authors: Anish Acharya, Sujay Sanghavi, Alexandros G. Dimakis, Inderjit S Dhillon,
- Abstract summary: We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2.<n>Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
- Score: 75.86423267723728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate -- a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning.
Related papers
- Geometric Median (GM) Matching for Robust Data Pruning [29.458270105150564]
We propose a herding style greedy algorithm that yields a $k$-subset that approximates the geometric median of a noisy dataset.<n>Experiments across several popular deep learning benchmarks indicate that $gm$ Matching consistently improves over prior state-of-the-art.
arXiv Detail & Related papers (2024-06-25T00:02:01Z) - Multigroup Robustness [5.659543670443081]
We study multigroup robust algorithms whose robustness guarantees for each subpopulation only degrade with the amount of data corruption inside that subpopulation.
Our techniques establish a new connection between multigroup fairness and robustness.
arXiv Detail & Related papers (2024-05-01T16:35:04Z) - DRoP: Distributionally Robust Data Pruning [11.930434318557156]
We conduct the first systematic study of the impact of data pruning on classification bias of trained models.<n>We propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks.
arXiv Detail & Related papers (2024-04-08T14:55:35Z) - CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models [1.6339731044538859]
This paper addresses the challenges of aligning large language models with human values via preference learning.
We propose a novel method for robustly and maliciously manipulated AI pipeline datasets to enhance LLMs' resilience.
arXiv Detail & Related papers (2024-03-05T07:58:12Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Adversarially robust clustering with optimality guarantees [7.0830450321883935]
We consider the problem of clustering data points coming from sub-Gaussian mixtures.
Existing methods that provably achieve the optimal mislabeling error, such as the Lloyd algorithm, are usually vulnerable to outliers.
We propose a simple robust algorithm based on the coordinatewise median that obtains the optimal mislabeling rate even when we allow adversarial outliers to be present.
arXiv Detail & Related papers (2023-06-16T17:17:07Z) - Revisiting Rotation Averaging: Uncertainties and Robust Losses [51.64986160468128]
We argue that the main problem of current methods is the minimized cost function that is only weakly connected with the input data via the estimated epipolar.
We propose to better model the underlying noise distributions by directly propagating the uncertainty from the point correspondences into the rotation averaging.
arXiv Detail & Related papers (2023-03-09T11:51:20Z) - Characterizing the Optimal 0-1 Loss for Multi-class Classification with
a Test-time Attacker [57.49330031751386]
We find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset.
We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints.
arXiv Detail & Related papers (2023-02-21T15:17:13Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.