Class-Proportional Coreset Selection for Difficulty-Separable Data
- URL: http://arxiv.org/abs/2507.10904v1
- Date: Tue, 15 Jul 2025 01:43:32 GMT
- Title: Class-Proportional Coreset Selection for Difficulty-Separable Data
- Authors: Elisa Tsai, Haizhong Zheng, Atul Prakash,
- Abstract summary: We show that in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class.<n>We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient.<n>Our results underscore that explicitly modeling class-difficulty separability leads to more effective, robust, and generalizable data pruning.
- Score: 6.999279165862482
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: High-quality training data is essential for building reliable and efficient machine learning systems. One-shot coreset selection addresses this by pruning the dataset while maintaining or even improving model performance, often relying on training-dynamics-based data difficulty scores. However, most existing methods implicitly assume class-wise homogeneity in data difficulty, overlooking variation in data difficulty across different classes. In this work, we challenge this assumption by showing that, in domains such as network intrusion detection and medical imaging, data difficulty often clusters by class. We formalize this as class-difficulty separability and introduce the Class Difficulty Separability Coefficient (CDSC) as a quantitative measure. We demonstrate that high CDSC values correlate with performance degradation in class-agnostic coreset methods, which tend to overrepresent easy majority classes while neglecting rare but informative ones. To address this, we introduce class-proportional variants of multiple sampling strategies. Evaluated on five diverse datasets spanning security and medical domains, our methods consistently achieve state-of-the-art data efficiency. For instance, on CTU-13, at an extreme 99% pruning rate, a class-proportional variant of Coverage-centric Coreset Selection (CCS-CP) shows remarkable stability, with accuracy dropping only 2.58%, precision 0.49%, and recall 0.19%. In contrast, the class-agnostic CCS baseline, the next best method, suffers sharper declines of 7.59% in accuracy, 4.57% in precision, and 4.11% in recall. We further show that aggressive pruning enhances generalization in noisy, imbalanced, and large-scale datasets. Our results underscore that explicitly modeling class-difficulty separability leads to more effective, robust, and generalizable data pruning, particularly in high-stakes scenarios.
Related papers
- The Impact of Coreset Selection on Spurious Correlations and Group Robustness [29.00056007029943]
Coreset selection methods have shown promise in reducing the training data size while maintaining model performance for data-efficient machine learning.<n>We conduct the first comprehensive analysis of the implications of data selection on the spurious bias levels of the selected coresets and the robustness of downstream models trained on them.
arXiv Detail & Related papers (2025-07-15T19:46:30Z) - Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information [2.133855532092057]
We propose an effective data reduction strategy based on Pointwise - Information (PVI)<n>Experiments show that the classifier performance is maintained with only a 0.0001% to 0.76% reduction in accuracy when 10%-30% of the data is removed.<n>We have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese NLP tasks and base models.
arXiv Detail & Related papers (2025-06-19T06:59:19Z) - Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning [19.152700266277247]
Non-Uniform Class-Wise Coreset Selection (NUCS) is a novel framework that integrates both class-level and instance-level criteria.<n>Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
arXiv Detail & Related papers (2025-04-17T15:40:51Z) - Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection [2.7554677967598047]
adversarially robust learning is widely recognized to demand significantly more training examples.<n>Recent works propose the use of self-supervised adversarial training with external or synthetically generated unlabeled data to enhance model robustness.<n>We propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement.
arXiv Detail & Related papers (2025-01-15T15:47:49Z) - SeMi: When Imbalanced Semi-Supervised Learning Meets Mining Hard Examples [54.760757107700755]
Semi-Supervised Learning (SSL) can leverage abundant unlabeled data to boost model performance.<n>The class-imbalanced data distribution in real-world scenarios poses great challenges to SSL, resulting in performance degradation.<n>We propose a method that enhances the performance of Imbalanced Semi-Supervised Learning by Mining Hard Examples (SeMi)
arXiv Detail & Related papers (2025-01-10T14:35:16Z) - SSL-CPCD: Self-supervised learning with composite pretext-class
discrimination for improved generalisability in endoscopic image analysis [3.1542695050861544]
Deep learning-based supervised methods are widely popular in medical image analysis.
They require a large amount of training data and face issues in generalisability to unseen datasets.
We propose to explore patch-level instance-group discrimination and penalisation of inter-class variation using additive angular margin.
arXiv Detail & Related papers (2023-05-31T21:28:08Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - Class-Aware Contrastive Semi-Supervised Learning [51.205844705156046]
We propose a general method named Class-aware Contrastive Semi-Supervised Learning (CCSSL) to improve pseudo-label quality and enhance the model's robustness in the real-world setting.
Our proposed CCSSL has significant performance improvements over the state-of-the-art SSL methods on the standard datasets CIFAR100 and STL10.
arXiv Detail & Related papers (2022-03-04T12:18:23Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Quasi-Global Momentum: Accelerating Decentralized Deep Learning on
Heterogeneous Data [77.88594632644347]
Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks.
In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization challenge.
We propose a novel momentum-based method to mitigate this decentralized training difficulty.
arXiv Detail & Related papers (2021-02-09T11:27:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.