Not All Data Matters: An End-to-End Adaptive Dataset Pruning Framework
for Enhancing Model Performance and Efficiency
- URL: http://arxiv.org/abs/2312.05599v1
- Date: Sat, 9 Dec 2023 16:01:21 GMT
- Title: Not All Data Matters: An End-to-End Adaptive Dataset Pruning Framework
for Enhancing Model Performance and Efficiency
- Authors: Suorong Yang, Hongchao Yang, Suhan Guo, Furao Shen, Jian Zhao
- Abstract summary: We propose an end-to-end Adaptive DAtaset PRUNing framework called AdaPruner.
AdaPruner iteratively prunes redundant samples to an expected pruning ratio.
It can still significantly enhance model performance even after pruning up to 10-30% of the training data.
- Score: 9.460023981858319
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While deep neural networks have demonstrated remarkable performance across
various tasks, they typically require massive training data. Due to the
presence of redundancies and biases in real-world datasets, not all data in the
training dataset contributes to the model performance. To address this issue,
dataset pruning techniques have been introduced to enhance model performance
and efficiency by eliminating redundant training samples and reducing
computational and memory overhead. However, previous works most rely on
manually crafted scalar scores, limiting their practical performance and
scalability across diverse deep networks and datasets. In this paper, we
propose AdaPruner, an end-to-end Adaptive DAtaset PRUNing framEwoRk. AdaPruner
can perform effective dataset pruning without the need for explicitly defined
metrics. Our framework jointly prunes training data and fine-tunes models with
task-specific optimization objectives. AdaPruner leverages (1) An adaptive
dataset pruning (ADP) module, which iteratively prunes redundant samples to an
expected pruning ratio; and (2) A pruning performance controller (PPC) module,
which optimizes the model performance for accurate pruning. Therefore,
AdaPruner exhibits high scalability and compatibility across various datasets
and deep networks, yielding improved dataset distribution and enhanced model
performance. AdaPruner can still significantly enhance model performance even
after pruning up to 10-30\% of the training data. Notably, these improvements
are accompanied by substantial savings in memory and computation costs.
Qualitative and quantitative experiments suggest that AdaPruner outperforms
other state-of-the-art dataset pruning methods by a large margin.
Related papers
- ImageNet-RIB Benchmark: Large Pre-Training Datasets Don't Guarantee Robustness after Fine-Tuning [30.422932548359952]
We introduce a new robust fine-tuning benchmark, ImageNet-RIB (Robustness Inheritance Benchmark)
The benchmark consists of related but distinct specialized (downstream) tasks.
We find that the continual learning methods, EWC and LwF maintain robustness after fine-tuning.
arXiv Detail & Related papers (2024-10-28T22:33:22Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization [30.738229850748137]
MolPeg is a Molecular data Pruning framework for enhanced Generalization.
It focuses on the source-free data pruning scenario, where data pruning is applied with pretrained models.
It consistently outperforms existing DP methods across four downstream tasks.
arXiv Detail & Related papers (2024-09-02T09:06:04Z) - Exploring Learning Complexity for Efficient Downstream Dataset Pruning [8.990878450631596]
Existing dataset pruning methods require training on the entire dataset.
We propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC)
Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.
arXiv Detail & Related papers (2024-02-08T02:29:33Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Building Resilience to Out-of-Distribution Visual Data via Input
Optimization and Model Finetuning [13.804184845195296]
We propose a preprocessing model that learns to optimise input data for a specific target vision model.
We investigate several out-of-distribution scenarios in the context of semantic segmentation for autonomous vehicles.
We demonstrate that our approach can enable performance on such data comparable to that of a finetuned model.
arXiv Detail & Related papers (2022-11-29T14:06:35Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.