Model-agnostic Coreset Selection via LLM-based Concept Bottlenecks
- URL: http://arxiv.org/abs/2502.16733v1
- Date: Sun, 23 Feb 2025 22:14:42 GMT
- Title: Model-agnostic Coreset Selection via LLM-based Concept Bottlenecks
- Authors: Akshay Mehra, Trisha Mittal, Subhadra Gopalakrishnan, Joshua Kimball,
- Abstract summary: Coreset Selection (CS) identifies a subset of training data that achieves model performance comparable to using the entire dataset.<n>These scores are inefficient to compute and hard to interpret as they do not indicate whether a sample is difficult to learn in general or only for a specific model.<n>Our work proposes an interpretable score that gauges a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model.
- Score: 6.857632954159568
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coreset Selection (CS) identifies a subset of training data that achieves model performance comparable to using the entire dataset. Many state-of-the-art CS methods, select coresets using scores whose computation requires training the downstream model on the entire dataset and recording changes in its behavior on samples as it trains (training dynamics). These scores are inefficient to compute and hard to interpret as they do not indicate whether a sample is difficult to learn in general or only for a specific model. Our work addresses these challenges by proposing an interpretable score that gauges a sample's difficulty using human-understandable textual attributes (concepts) independent of any downstream model. Specifically, we measure the alignment between a sample's visual features and concept bottlenecks, derived via large language models, by training a linear concept bottleneck layer and compute the sample's difficulty score using it. We then use this score and a stratified sampling strategy to identify the coreset. Crucially, our score is efficiently computable without training the downstream model on the full dataset even once, leads to high-performing coresets for various downstream models, and is computable even for an unlabeled dataset. Through experiments on CIFAR-10, CIFAR-100, and ImageNet-1K, we show our coresets outperform random subsets, even at high pruning rates, and achieve model performance comparable to or better than coresets found by training dynamics-based methods.
Related papers
- In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models [37.45103473809928]
We propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model.
By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data.
arXiv Detail & Related papers (2024-08-07T05:48:05Z) - TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning [70.98091101459421]
Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
arXiv Detail & Related papers (2023-10-11T23:01:29Z) - Optimal Sample Selection Through Uncertainty Estimation and Its
Application in Deep Learning [22.410220040736235]
We present a theoretically optimal solution for addressing both coreset selection and active learning.
Our proposed method, COPS, is designed to minimize the expected loss of a model trained on subsampled data.
arXiv Detail & Related papers (2023-09-05T14:06:33Z) - Probabilistic Bilevel Coreset Selection [24.874967723659022]
We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample.
We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
arXiv Detail & Related papers (2023-01-24T09:37:00Z) - Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space.
We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Active Learning for Deep Visual Tracking [51.5063680734122]
Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years.
In this paper, we propose an active learning method for deep visual tracking, which selects and annotates the unlabeled samples to train the deep CNNs model.
Under the guidance of active learning, the tracker based on the trained deep CNNs model can achieve competitive tracking performance while reducing the labeling cost.
arXiv Detail & Related papers (2021-10-17T11:47:56Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Balancing Constraints and Submodularity in Data Subset Selection [43.03720397062461]
We show that one can achieve similar accuracy to traditional deep-learning models, while using less training data.
We propose a novel diversity driven objective function, and balancing constraints on class labels and decision boundaries using matroids.
arXiv Detail & Related papers (2021-04-26T19:22:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.