Coverage-centric Coreset Selection for High Pruning Rates
- URL: http://arxiv.org/abs/2210.15809v1
- Date: Fri, 28 Oct 2022 00:14:00 GMT
- Title: Coverage-centric Coreset Selection for High Pruning Rates
- Authors: Haizhong Zheng, Rui Liu, Fan Lai, Atul Prakash
- Abstract summary: One-shot coreset selection aims to select a subset of the training data, given a pruning rate, that can achieve high accuracy for models that are subsequently trained only with that subset.
State-of-the-art coreset selection methods typically assign an importance score to each example and select the most important examples to form a coreset.
But at high pruning rates, they have been found to suffer a catastrophic accuracy drop, performing worse than even random coreset selection.
- Score: 11.18635356469467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One-shot coreset selection aims to select a subset of the training data,
given a pruning rate, that can achieve high accuracy for models that are
subsequently trained only with that subset. State-of-the-art coreset selection
methods typically assign an importance score to each example and select the
most important examples to form a coreset. These methods perform well at low
pruning rates; but at high pruning rates, they have been found to suffer a
catastrophic accuracy drop, performing worse than even random coreset
selection. In this paper, we explore the reasons for this accuracy drop both
theoretically and empirically. We extend previous theoretical results on the
bound for model loss in terms of coverage provided by the coreset. Inspired by
theoretical results, we propose a novel coverage-based metric and, based on the
metric, find that coresets selected by importance-based coreset methods at high
pruning rates can be expected to perform poorly compared to random coresets
because of worse data coverage. We then propose a new coreset selection method,
Coverage-centric Coreset Selection (CCS), where we jointly consider overall
data coverage based on the proposed metric as well as importance of each
example. We evaluate CCS on four datasets and show that they achieve
significantly better accuracy than state-of-the-art coreset selection methods
as well as random sampling under high pruning rates, and comparable performance
at low pruning rates. For example, CCS achieves 7.04% better accuracy than
random sampling and at least 20.16% better than popular importance-based
selection methods on CIFAR10 with a 90% pruning rate.
Related papers
- Speculative Coreset Selection for Task-Specific Fine-tuning [35.15159197063161]
Task-specific fine-tuning is essential for the deployment of large language models (LLMs)
In this paper, we introduce STAFF, a speculative coreset selection method.
We show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates.
arXiv Detail & Related papers (2024-10-02T07:42:25Z) - Optimal Kernel Choice for Score Function-based Causal Discovery [92.65034439889872]
We propose a kernel selection method within the generalized score function that automatically selects the optimal kernel that best fits the data.
We conduct experiments on both synthetic data and real-world benchmarks, and the results demonstrate that our proposed method outperforms kernel selection methods.
arXiv Detail & Related papers (2024-07-14T09:32:20Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Refined Coreset Selection: Towards Minimal Coreset Size under Model
Performance Constraints [69.27190330994635]
Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms.
We propose an innovative method, which maintains optimization priority order over the model performance and coreset size.
Empirically, extensive experiments confirm its superiority, often yielding better model performance with smaller coreset sizes.
arXiv Detail & Related papers (2023-11-15T03:43:04Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Extending Contrastive Learning to Unsupervised Coreset Selection [26.966136750754732]
We propose an unsupervised way of selecting a core-set entirely unlabeled.
We use two leading methods for contrastive learning.
Compared with existing coreset selection methods with labels, our approach reduced the cost associated with human annotation.
arXiv Detail & Related papers (2021-03-05T10:21:51Z) - Data-Independent Structured Pruning of Neural Networks via Coresets [21.436706159840018]
We propose the first efficient structured pruning algorithm with a provable trade-off between its compression rate and the approximation error for any future test sample.
Unlike previous works, our coreset is data independent, meaning that it provably guarantees the accuracy of the function for any input $xin mathbbRd$, including an adversarial one.
arXiv Detail & Related papers (2020-08-19T08:03:09Z) - Bayesian Coresets: Revisiting the Nonconvex Optimization Perspective [30.963638533636352]
We propose and analyze a novel algorithm for coreset selection.
We provide explicit convergence rate guarantees and present an empirical evaluation on a variety of benchmark datasets.
arXiv Detail & Related papers (2020-07-01T19:34:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.