Probabilistic Bilevel Coreset Selection
- URL: http://arxiv.org/abs/2301.09880v1
- Date: Tue, 24 Jan 2023 09:37:00 GMT
- Title: Probabilistic Bilevel Coreset Selection
- Authors: Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Tong Zhang
- Abstract summary: We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample.
We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
- Score: 24.874967723659022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of coreset selection in supervised learning is to produce a weighted
subset of data, so that training only on the subset achieves similar
performance as training on the entire dataset. Existing methods achieved
promising results in resource-constrained scenarios such as continual learning
and streaming. However, most of the existing algorithms are limited to
traditional machine learning models. A few algorithms that can handle large
models adopt greedy search approaches due to the difficulty in solving the
discrete subset selection problem, which is computationally costly when coreset
becomes larger and often produces suboptimal results. In this work, for the
first time we propose a continuous probabilistic bilevel formulation of coreset
selection by learning a probablistic weight for each training sample. The
overall objective is posed as a bilevel optimization problem, where 1) the
inner loop samples coresets and train the model to convergence and 2) the outer
loop updates the sample probability progressively according to the model's
performance. Importantly, we develop an efficient solver to the bilevel
optimization problem via unbiased policy gradient without trouble of implicit
differentiation. We provide the convergence property of our training procedure
and demonstrate the superiority of our algorithm against various coreset
selection methods in various tasks, especially in more challenging label-noise
and class-imbalance scenarios.
Related papers
- Meta-Learning from Learning Curves for Budget-Limited Algorithm Selection [11.409496019407067]
In a budget-limited scenario, it is crucial to carefully select an algorithm candidate and allocate a budget for training it.
We propose a novel framework in which an agent must select in the process of learning the most promising algorithm without waiting until it is fully trained.
arXiv Detail & Related papers (2024-10-10T08:09:58Z) - TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Adaptive Second Order Coresets for Data-efficient Machine Learning [5.362258158646462]
Training machine learning models on datasets incurs substantial computational costs.
We propose AdaCore to extract subsets of the training examples for efficient machine learning.
arXiv Detail & Related papers (2022-07-28T05:43:09Z) - Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data.
In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z) - Low Budget Active Learning via Wasserstein Distance: An Integer
Programming Approach [81.19737119343438]
Active learning is the process of training a model with limited labeled data by selecting a core subset of an unlabeled data pool to label.
We propose a new integer optimization problem for selecting a core set that minimizes the discrete Wasserstein distance from the unlabeled pool.
Our strategy requires high-quality latent features which we obtain by unsupervised learning on the unlabeled pool.
arXiv Detail & Related papers (2021-06-05T21:25:03Z) - Outlier-Robust Learning of Ising Models Under Dobrushin's Condition [57.89518300699042]
We study the problem of learning Ising models satisfying Dobrushin's condition in the outlier-robust setting where a constant fraction of the samples are adversarially corrupted.
Our main result is to provide the first computationally efficient robust learning algorithm for this problem with near-optimal error guarantees.
arXiv Detail & Related papers (2021-02-03T18:00:57Z) - Uncovering Coresets for Classification With Multi-Objective Evolutionary
Algorithms [0.8057006406834467]
A coreset is a subset of the training set, using which a machine learning algorithm obtains performances similar to what it would deliver if trained over the whole original data.
A novel approach is presented: candidate corsets are iteratively optimized, adding and removing samples.
A multi-objective evolutionary algorithm is used to minimize simultaneously the number of points in the set and the classification error.
arXiv Detail & Related papers (2020-02-20T09:59:56Z) - Stepwise Model Selection for Sequence Prediction via Deep Kernel
Learning [100.83444258562263]
We propose a novel Bayesian optimization (BO) algorithm to tackle the challenge of model selection in this setting.
In order to solve the resulting multiple black-box function optimization problem jointly and efficiently, we exploit potential correlations among black-box functions.
We are the first to formulate the problem of stepwise model selection (SMS) for sequence prediction, and to design and demonstrate an efficient joint-learning algorithm for this purpose.
arXiv Detail & Related papers (2020-01-12T09:42:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.