SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection
- URL: http://arxiv.org/abs/2509.21748v1
- Date: Fri, 26 Sep 2025 01:26:45 GMT
- Title: SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection
- Authors: Brian B. Moser, Tobias C. Nauen, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Joachim Folz, Andreas Dengel,
- Abstract summary: SubZeroCore is a training-free coreset selection method that integrates submodular coverage and density into a single, unified objective.<n>We show that SubZeroCore matches training-based baselines and significantly outperforms them at high pruning rates, while dramatically reducing computational overhead.
- Score: 9.129619927191973
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of coreset selection is to identify representative subsets of datasets for efficient model training. Yet, existing approaches paradoxically require expensive training-based signals, e.g., gradients, decision boundary estimates or forgetting counts, computed over the entire dataset prior to pruning, which undermines their very purpose by requiring training on samples they aim to avoid. We introduce SubZeroCore, a novel, training-free coreset selection method that integrates submodular coverage and density into a single, unified objective. To achieve this, we introduce a sampling strategy based on a closed-form solution to optimally balance these objectives, guided by a single hyperparameter that explicitly controls the desired coverage for local density measures. Despite no training, extensive evaluations show that SubZeroCore matches training-based baselines and significantly outperforms them at high pruning rates, while dramatically reducing computational overhead. SubZeroCore also demonstrates superior robustness to label noise, highlighting its practical effectiveness and scalability for real-world scenarios.
Related papers
- UNSEEN: Enhancing Dataset Pruning from a Generalization Perspective [17.593940249922557]
We propose a plug-and-play framework, UNSEEN, which can be integrated into existing dataset pruning methods.<n>We scale UNSEEN to multi-step scenarios and propose an incremental selection technique through scoring models trained on varying coresets.<n>Our method significantly outperforms existing state-of-the-art (SOTA) methods on CIFAR-10, CIFAR-100, and ImageNet-1K.
arXiv Detail & Related papers (2025-11-17T05:17:39Z) - The Easy Path to Robustness: Coreset Selection using Sample Hardness [12.378609890122945]
We propose a framework linking a sample's adversarial vulnerability to its textithardness, which we quantify using the average input gradient norm (AIGN) over training.<n>We present EasyCore, a coreset selection algorithm that retains only the samples with low AIGN for training.<n>We empirically show that models trained on EasyCore-selected data achieve significantly higher adversarial accuracy than those trained with competing coreset methods.
arXiv Detail & Related papers (2025-10-13T05:28:16Z) - Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning [19.152700266277247]
Non-Uniform Class-Wise Coreset Selection (NUCS) is a novel framework that integrates both class-level and instance-level criteria.<n>Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
arXiv Detail & Related papers (2025-04-17T15:40:51Z) - Coreset Selection via LLM-based Concept Bottlenecks [6.857632954159568]
Coreset Selection (CS) aims to identify a subset of the training dataset that achieves model performance comparable to using the entire dataset.<n>Our work proposes a score that computes a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model.<n>We show that our coresets outperform random subsets, even at high pruning rates, and achieve model performance comparable to or better than coresets found by training dynamics-based methods.
arXiv Detail & Related papers (2025-02-23T22:14:42Z) - Zero-Shot Coreset Selection: Efficient Pruning for Unlabeled Data [22.45812577928658]
Coreset selection aims to find a representative subset of data to train models.
ZCore is a method that efficiently selects coresets without ground truth labels or training on candidate data.
We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods.
arXiv Detail & Related papers (2024-11-22T21:17:49Z) - Towards Continual Learning Desiderata via HSIC-Bottleneck
Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion.
Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z) - Refined Coreset Selection: Towards Minimal Coreset Size under Model
Performance Constraints [69.27190330994635]
Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms.
We propose an innovative method, which maintains optimization priority order over the model performance and coreset size.
Empirically, extensive experiments confirm its superiority, often yielding better model performance with smaller coreset sizes.
arXiv Detail & Related papers (2023-11-15T03:43:04Z) - MILO: Model-Agnostic Subset Selection Framework for Efficient Model
Training and Tuning [68.12870241637636]
We propose MILO, a model-agnostic subset selection framework that decouples the subset selection from model training.
Our empirical results indicate that MILO can train models $3times - 10 times$ faster and tune hyperparameters $20times - 75 times$ faster than full-dataset training or tuning without performance.
arXiv Detail & Related papers (2023-01-30T20:59:30Z) - Adaptive Second Order Coresets for Data-efficient Machine Learning [5.362258158646462]
Training machine learning models on datasets incurs substantial computational costs.
We propose AdaCore to extract subsets of the training examples for efficient machine learning.
arXiv Detail & Related papers (2022-07-28T05:43:09Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Weakly Supervised Deep Nuclei Segmentation Using Partial Points
Annotation in Histopathology Images [51.893494939675314]
We propose a novel weakly supervised segmentation framework based on partial points annotation.
We show that our method can achieve competitive performance compared to the fully supervised counterpart and the state-of-the-art methods.
arXiv Detail & Related papers (2020-07-10T15:41:29Z) - Ensemble Wrapper Subsampling for Deep Modulation Classification [70.91089216571035]
Subsampling of received wireless signals is important for relaxing hardware requirements as well as the computational cost of signal processing algorithms.
We propose a subsampling technique to facilitate the use of deep learning for automatic modulation classification in wireless communication systems.
arXiv Detail & Related papers (2020-05-10T06:11:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.