Coreset selection based on Intra-class diversity
- URL: http://arxiv.org/abs/2509.21380v1
- Date: Tue, 23 Sep 2025 12:45:53 GMT
- Title: Coreset selection based on Intra-class diversity
- Authors: Imran Ashraf, Mukhtar Ullah, Muhammad Faisal Nadeem, Muhammad Nouman Noor,
- Abstract summary: Recent studies have adopted two different approaches to train Deep Learning models: training from scratch and transfer learning.<n>With the growing sizes of datasets, exploring solutions to this problem has recently gained the research community's attention.<n>We propose a method to extract intra-class diversity, forming per-class clusters that are utilized for the final sampling.
- Score: 4.638022690041187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep Learning models have transformed various domains, including the healthcare sector, particularly biomedical image classification by learning intricate features and enabling accurate diagnostics pertaining to complex diseases. Recent studies have adopted two different approaches to train DL models: training from scratch and transfer learning. Both approaches demand substantial computational time and resources due to the involvement of massive datasets in model training. These computational demands are further increased due to the design-space exploration required for selecting optimal hyperparameters, which typically necessitates several training rounds. With the growing sizes of datasets, exploring solutions to this problem has recently gained the research community's attention. A plausible solution is to select a subset of the dataset for training and hyperparameter search. This subset, referred to as the corset, must be a representative set of the original dataset. A straightforward approach to selecting the coreset could be employing random sampling, albeit at the cost of compromising the representativeness of the original dataset. A critical limitation of random sampling is the bias towards the dominant classes in an imbalanced dataset. Even if the dataset has inter-class balance, this random sampling will not capture intra-class diversity. This study addresses this issue by introducing an intelligent, lightweight mechanism for coreset selection. Specifically, it proposes a method to extract intra-class diversity, forming per-class clusters that are utilized for the final sampling. We demonstrate the efficacy of the proposed methodology by conducting extensive classification experiments on a well-known biomedical imaging dataset. Results demonstrate that the proposed scheme outperforms the random sampling approach on several performance metrics for uniform conditions.
Related papers
- Non-Uniform Class-Wise Coreset Selection: Characterizing Category Difficulty for Data-Efficient Transfer Learning [19.152700266277247]
Non-Uniform Class-Wise Coreset Selection (NUCS) is a novel framework that integrates both class-level and instance-level criteria.<n>Our work highlights the importance of characterizing category difficulty in coreset selection, offering a robust and data-efficient solution for transfer learning.
arXiv Detail & Related papers (2025-04-17T15:40:51Z) - Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm [50.492124556982674]
This paper introduces a novel choice-based sample selection framework.<n>It shifts the focus from evaluating individual sample quality to comparing the contribution value of different samples.<n>We validate our approach on a larger medical dataset, highlighting its practical applicability in real-world applications.
arXiv Detail & Related papers (2025-03-04T07:32:41Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement [8.509688686402438]
Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities.
This work addresses the question: How can we determine the optimal subset of data for effective training?
Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset.
arXiv Detail & Related papers (2024-09-17T17:25:31Z) - Dataset Quantization with Active Learning based Adaptive Sampling [11.157462442942775]
We show that maintaining performance is feasible even with uneven sample distributions.
We propose a novel active learning based adaptive sampling strategy to optimize the sample selection.
Our approach outperforms the state-of-the-art dataset compression methods.
arXiv Detail & Related papers (2024-07-09T23:09:18Z) - JSCDS: A Core Data Selection Method with Jason-Shannon Divergence for Caries RGB Images-Efficient Learning [2.508255511130695]
The performance of deep learning models depends on high-quality data and requires substantial training resources.
We propose a Core Data Selection Method with Jensen-Shannon Divergence (JSCDS) for efficient caries image learning and caries classification.
JSCDS outperforms other data selection methods in prediction performance and time consumption.
arXiv Detail & Related papers (2024-06-29T08:19:25Z) - Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Foundation Model Makes Clustering A Better Initialization For Cold-Start Active Learning [5.609241010973952]
We propose to integrate foundation models with clustering methods to select samples for cold-start active learning.
Foundation models refer to those trained on massive datasets by the self-supervised paradigm.
For a comprehensive comparison, we included a classic ImageNet-supervised model to acquire embeddings.
arXiv Detail & Related papers (2024-02-04T16:27:37Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Reinforced Data Sampling for Model Diversification [15.547681142342846]
This paper proposes a new Reinforced Data Sampling (RDS) method to learn how to sample data adequately.
We formulate the optimisation problem of model diversification $delta-div$ in data sampling to maximise learning potentials and optimum allocation by injecting model diversity.
Our results suggest that the trainable sampling for model diversification is useful for competition organisers, researchers, or even starters to pursue full potentials of various machine learning tasks.
arXiv Detail & Related papers (2020-06-12T11:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.