Exploring Learning Complexity for Efficient Downstream Dataset Pruning
- URL: http://arxiv.org/abs/2402.05356v2
- Date: Tue, 08 Oct 2024 13:56:33 GMT
- Title: Exploring Learning Complexity for Efficient Downstream Dataset Pruning
- Authors: Wenyu Jiang, Zhenlong Liu, Zejian Xie, Songxin Zhang, Bingyi Jing, Hongxin Wei,
- Abstract summary: Existing dataset pruning methods require training on the entire dataset.
We propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC)
Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.
- Score: 8.990878450631596
- License:
- Abstract: The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - RPLKG: Robust Prompt Learning with Knowledge Graph [11.893917358053004]
We propose a new method, robust prompt learning with knowledge graph (RPLKG)
Based on the knowledge graph, we automatically design diverse interpretable and meaningful prompt sets.
RPLKG shows a significant performance improvement compared to zero-shot learning.
arXiv Detail & Related papers (2023-04-21T08:22:58Z) - On Measuring the Intrinsic Few-Shot Hardness of Datasets [49.37562545777455]
We show that few-shot hardness may be intrinsic to datasets, for a given pre-trained model.
We propose a simple and lightweight metric called "Spread" that captures the intuition that few-shot learning is made possible.
Our metric better accounts for few-shot hardness compared to existing notions of hardness, and is 8-100x faster to compute.
arXiv Detail & Related papers (2022-11-16T18:53:52Z) - Dataset Condensation with Distribution Matching [30.571335208276246]
dataset condensation aims to replace the original large training set with a significantly smaller learned synthetic set.
We propose a simple yet effective dataset condensation technique that requires significantly lower training cost.
Thanks to its efficiency, we apply our method to more realistic and larger datasets with sophisticated neural architectures.
arXiv Detail & Related papers (2021-10-08T15:02:30Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.