Dataset Pruning: Reducing Training Data by Examining Generalization
Influence
- URL: http://arxiv.org/abs/2205.09329v1
- Date: Thu, 19 May 2022 05:36:35 GMT
- Title: Dataset Pruning: Reducing Training Data by Examining Generalization
Influence
- Authors: Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, Ping Li
- Abstract summary: Do all training data contribute to model's performance?
How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
- Score: 30.30255670341501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The great success of deep learning heavily relies on increasingly larger
training data, which comes at a price of huge computational and infrastructural
costs. This poses crucial questions that, do all training data contribute to
model's performance? How much does each individual training sample or a
sub-training-set affect the model's generalization, and how to construct a
smallest subset from the entire training data as a proxy training set without
significantly sacrificing the model's performance? To answer these, we propose
dataset pruning, an optimization-based sample selection method that can (1)
examine the influence of removing a particular set of training samples on
model's generalization ability with theoretical guarantee, and (2) construct a
smallest subset of training data that yields strictly constrained
generalization gap. The empirically observed generalization gap of dataset
pruning is substantially consistent with our theoretical expectations.
Furthermore, the proposed method prunes 40% training examples on the CIFAR-10
dataset, halves the convergence time with only 1.3% test accuracy decrease,
which is superior to previous score-based sample selection methods.
Related papers
- Optimize Cardinality Estimation Model Pretraining by Simplifying the Training Datasets [0.0]
We introduce a simplified training dataset, which has been reduced to a fraction of the size of existing pretraining datasets.
Sufficient experimental results demonstrate that the pre-trained cardinality estimator based on this simplified dataset can still achieve comparable performance to existing models in zero-shot setups.
arXiv Detail & Related papers (2025-02-20T08:06:16Z) - Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance.
We introduce novel algorithms for dynamic, instance-level data reweighting.
Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS)
Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z) - Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process.
We conducted a thorough examination of 150 classification datasets.
We find that pre-training can only help improve the optimization process for models that fit the data poorly.
Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z) - An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration [11.102950630209879]
In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy.
We examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration.
arXiv Detail & Related papers (2023-07-17T01:27:10Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.