D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning
- URL: http://arxiv.org/abs/2310.07931v1
- Date: Wed, 11 Oct 2023 23:01:29 GMT
- Title: D2 Pruning: Message Passing for Balancing Diversity and Difficulty in
Data Pruning
- Authors: Adyasha Maharana, Prateek Yadav, Mohit Bansal
- Abstract summary: Coreset selection seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset.
We propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection.
Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates.
- Score: 70.98091101459421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Analytical theories suggest that higher-quality data can lead to lower test
errors in models trained on a fixed data budget. Moreover, a model can be
trained on a lower compute budget without compromising performance if a dataset
can be stripped of its redundancies. Coreset selection (or data pruning) seeks
to select a subset of the training data so as to maximize the performance of
models trained on this subset, also referred to as coreset. There are two
dominant approaches: (1) geometry-based data selection for maximizing data
diversity in the coreset, and (2) functions that assign difficulty scores to
samples based on training dynamics. Optimizing for data diversity leads to a
coreset that is biased towards easier samples, whereas, selection by difficulty
ranking omits easy samples that are necessary for the training of deep learning
models. This demonstrates that data diversity and importance scores are two
complementary factors that need to be jointly considered during coreset
selection. We represent a dataset as an undirected graph and propose a novel
pruning algorithm, D2 Pruning, that uses forward and reverse message passing
over this dataset graph for coreset selection. D2 Pruning updates the
difficulty scores of each example by incorporating the difficulty of its
neighboring examples in the dataset graph. Then, these updated difficulty
scores direct a graph-based sampling method to select a coreset that
encapsulates both diverse and difficult regions of the dataset space. We
evaluate supervised and self-supervised versions of our method on various
vision and language datasets. Results show that D2 Pruning improves coreset
selection over previous state-of-the-art methods for up to 70% pruning rates.
Additionally, we find that using D2 Pruning for filtering large multimodal
datasets leads to increased diversity in the dataset and improved
generalization of pretrained models.
Related papers
- 3DS: Decomposed Difficulty Data Selection's Case Study on LLM Medical Domain Adaptation [13.058299222554295]
Large Language Models excel in general tasks but struggle in specialized domains like healthcare.
We propose a two-stage model-centric data selection framework, De Difficulty Data Selection (3DS)
Our experiments on real-world healthcare datasets demonstrate the superiority of 3DS over exisiting methods in accuracy by over 5.29%.
arXiv Detail & Related papers (2024-10-13T02:29:00Z) - TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Probabilistic Bilevel Coreset Selection [24.874967723659022]
We propose a continuous probabilistic bilevel formulation of coreset selection by learning a probablistic weight for each training sample.
We develop an efficient solver to the bilevel optimization problem via unbiased policy gradient without trouble of implicit differentiation.
arXiv Detail & Related papers (2023-01-24T09:37:00Z) - Auto-weighted Multi-view Feature Selection with Graph Optimization [90.26124046530319]
We propose a novel unsupervised multi-view feature selection model based on graph learning.
The contributions are threefold: (1) during the feature selection procedure, the consensus similarity graph shared by different views is learned.
Experiments on various datasets demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-04-11T03:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.