Related papers: A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances

URL: http://arxiv.org/abs/2505.17799v1
Date: Fri, 23 May 2025 12:18:34 GMT
Title: A Coreset Selection of Coreset Selection Literature: Introduction and Recent Advances
Authors: Brian B. Moser, Arundhati S. Shanbhag, Stanislav Frolov, Federico Raue, Joachim Folz, Andreas Dengel,
Abstract summary: Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning.<n>This survey presents a more comprehensive view by unifying three major lines of coreset research into a single taxonomy.<n>We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets.
Score: 8.319613769928331
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Coreset selection targets the challenge of finding a small, representative subset of a large dataset that preserves essential patterns for effective machine learning. Although several surveys have examined data reduction strategies before, most focus narrowly on either classical geometry-based methods or active learning techniques. In contrast, this survey presents a more comprehensive view by unifying three major lines of coreset research, namely, training-free, training-oriented, and label-free approaches, into a single taxonomy. We present subfields often overlooked by existing work, including submodular formulations, bilevel optimization, and recent progress in pseudo-labeling for unlabeled datasets. Additionally, we examine how pruning strategies influence generalization and neural scaling laws, offering new insights that are absent from prior reviews. Finally, we compare these methods under varying computational, robustness, and performance demands and highlight open challenges, such as robustness, outlier filtering, and adapting coreset selection to foundation models, for future research.

Related papers

Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning [19.152700266277247]
Non-Uniform Class-Wise Coreset Selection (NUCS) is a novel framework that integrates both class-level and instance-level criteria.<n>Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
arXiv Detail & Related papers (2025-04-17T15:40:51Z)
Learning from Neighbors: Category Extrapolation for Long-Tail Learning [62.30734737735273]
We offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance.<n>We introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes.<n>To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss.
arXiv Detail & Related papers (2024-10-21T13:06:21Z)
Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [36.22392593103493]
Data selection for fine-tuning large language models (LLMs) aims to choose a high-quality subset from existing datasets.<n>Existing surveys overlook an in-depth exploration of the fine-tuning phase.<n>We introduce a novel three-stage scheme - comprising feature extraction, criteria design, and selector evaluation - to systematically categorize and evaluate these methods.
arXiv Detail & Related papers (2024-06-20T08:58:58Z)
Foundation Model Makes Clustering A Better Initialization For Cold-Start Active Learning [5.609241010973952]
We propose to integrate foundation models with clustering methods to select samples for cold-start active learning. Foundation models refer to those trained on massive datasets by the self-supervised paradigm. For a comprehensive comparison, we included a classic ImageNet-supervised model to acquire embeddings.
arXiv Detail & Related papers (2024-02-04T16:27:37Z)
Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints [69.27190330994635]
Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms. We propose an innovative method, which maintains optimization priority order over the model performance and coreset size. Empirically, extensive experiments confirm its superiority, often yielding better model performance with smaller coreset sizes.
arXiv Detail & Related papers (2023-11-15T03:43:04Z)
DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection [72.25697820290502]
This work introduces a straightforward and efficient strategy to identify potential novel classes through zero-shot classification. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance.
arXiv Detail & Related papers (2023-10-02T17:52:24Z)
Large-scale Pre-trained Models are Surprisingly Strong in Incremental Novel Class Discovery [76.63807209414789]
We challenge the status quo in class-iNCD and propose a learning paradigm where class discovery occurs continuously and truly unsupervisedly. We propose simple baselines, composed of a frozen PTM backbone and a learnable linear classifier, that are not only simple to implement but also resilient under longer learning scenarios.
arXiv Detail & Related papers (2023-03-28T13:47:16Z)
A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions [48.97008907275482]
Clustering is a fundamental machine learning task which has been widely studied in the literature. Deep Clustering, i.e., jointly optimizing the representation learning and clustering, has been proposed and hence attracted growing attention in the community. We summarize the essential components of deep clustering and categorize existing methods by the ways they design interactions between deep representation learning and clustering.
arXiv Detail & Related papers (2022-06-15T15:05:13Z)
DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning [3.897574108827803]
We provide an empirical study on popular coreset selection methods on CIFAR10 and ImageNet datasets. Although some methods perform better in certain experiment settings, random selection is still a strong baseline.
arXiv Detail & Related papers (2022-04-18T18:14:30Z)
A Simple Yet Effective Pretraining Strategy for Graph Few-shot Learning [38.66690010054665]
We propose a simple transductive fine-tuning based framework as a new paradigm for graph few-shot learning. For pretraining, we propose a supervised contrastive learning framework with data augmentation strategies specific for few-shot node classification.
arXiv Detail & Related papers (2022-03-29T22:30:00Z)
Data Summarization via Bilevel Optimization [48.89977988203108]
A simple yet powerful approach is to operate on small subsets of data. In this work, we propose a generic coreset framework that formulates the coreset selection as a cardinality-constrained bilevel optimization problem.
arXiv Detail & Related papers (2021-09-26T09:08:38Z)
Online Coreset Selection for Rehearsal-based Continual Learning [65.85595842458882]
In continual learning, we store a subset of training examples (coreset) to be replayed later to alleviate catastrophic forgetting. We propose Online Coreset Selection (OCS), a simple yet effective method that selects the most representative and informative coreset at each iteration. Our proposed method maximizes the model's adaptation to a target dataset while selecting high-affinity samples to past tasks, which directly inhibits catastrophic forgetting.
arXiv Detail & Related papers (2021-06-02T11:39:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.