DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning
- URL: http://arxiv.org/abs/2204.08499v1
- Date: Mon, 18 Apr 2022 18:14:30 GMT
- Title: DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning
- Authors: Chengcheng Guo, Bo Zhao, Yanbing Bai
- Abstract summary: We provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets.
Although some methods perform better in certain experiment settings, random selection is still a strong baseline.
- Score: 3.897574108827803
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Coreset selection, which aims to select a subset of the most informative
training samples, is a long-standing learning problem that can benefit many
downstream tasks such as data-efficient learning, continual learning, neural
architecture search, active learning, etc. However, many existing coreset
selection methods are not designed for deep learning, which may have high
complexity and poor generalization ability to unseen representations. In
addition, the recently proposed methods are evaluated on models, datasets, and
settings of different complexities. To advance the research of coreset
selection in deep learning, we contribute a comprehensive code library, namely
DeepCore, and provide an empirical study on popular coreset selection methods
on CIFAR10 and ImageNet datasets. Extensive experiment results show that,
although some methods perform better in certain experiment settings, random
selection is still a strong baseline.
Related papers
- TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Probabilistic Bilevel Coreset Selection [24.874967723659022]
We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample.
We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
arXiv Detail & Related papers (2023-01-24T09:37:00Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - What Makes Good Contrastive Learning on Small-Scale Wearable-based
Tasks? [59.51457877578138]
We study contrastive learning on the wearable-based activity recognition task.
This paper presents an open-source PyTorch library textttCL-HAR, which can serve as a practical tool for researchers.
arXiv Detail & Related papers (2022-02-12T06:10:15Z) - Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting.
We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration.
Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z) - Extending Contrastive Learning to Unsupervised Coreset Selection [26.966136750754732]
We propose an unsupervised way of selecting a core-set entirely unlabeled.
We use two leading methods for contrastive learning.
Compared with existing coreset selection methods with labels, our approach reduced the cost associated with human annotation.
arXiv Detail & Related papers (2021-03-05T10:21:51Z) - Confident Coreset for Active Learning in Medical Image Analysis [57.436224561482966]
We propose a novel active learning method, confident coreset, which considers both uncertainty and distribution for effectively selecting informative samples.
By comparative experiments on two medical image analysis tasks, we show that our method outperforms other active learning methods.
arXiv Detail & Related papers (2020-04-05T13:46:16Z) - Learning to Select Base Classes for Few-shot Classification [96.92372639495551]
We use the Similarity Ratio as an indicator for the generalization performance of a few-shot model.
We then formulate the base class selection problem as a submodular optimization problem over Similarity Ratio.
arXiv Detail & Related papers (2020-04-01T09:55:18Z) - Uncovering Coresets for Classification With Multi-Objective Evolutionary
Algorithms [0.8057006406834467]
A coreset is a subset of the training set, using which a machine learning algorithm obtains performances similar to what it would deliver if trained over the whole original data.
A novel approach is presented: candidate corsets are iteratively optimized, adding and removing samples.
A multi-objective evolutionary algorithm is used to minimize simultaneously the number of points in the set and the classification error.
arXiv Detail & Related papers (2020-02-20T09:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.