Related papers: Dataset Pruning: Reducing Training Data by Examining Generalization Influence

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

URL: http://arxiv.org/abs/2205.09329v1
Date: Thu, 19 May 2022 05:36:35 GMT
Title: Dataset Pruning: Reducing Training Data by Examining Generalization Influence
Authors: Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, Ping Li
Abstract summary: Do all training data contribute to model's performance? How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
Score: 30.30255670341501
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct a smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.

Related papers

SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Effective Data Pruning through Score Extrapolation [40.61665742457229]
We introduce a novel importance score extrapolation framework that requires training on only a small subset of data.<n>We present two initial approaches in this framework to accurately predict sample importance for the entire dataset using patterns learned from this minimal subset.<n>Our results indicate that score extrapolation is a promising direction to scale expensive score calculation methods, such as pruning, data attribution, or other tasks.
arXiv Detail & Related papers (2025-06-10T17:38:49Z)
LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning [22.242445543184264]
We propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop.<n>Experiments show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10x.
arXiv Detail & Related papers (2025-05-12T10:57:51Z)
How to Achieve Higher Accuracy with Less Training Points? [2.1834099301440526]
We propose a technique based on influence functions to determine which training samples should be included in the training set. Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data.
arXiv Detail & Related papers (2025-04-18T09:38:26Z)
Optimize Cardinality Estimation Model Pretraining by Simplifying the Training Datasets [0.0]
We introduce a simplified training dataset, which has been reduced to a fraction of the size of existing pretraining datasets. Sufficient experimental results demonstrate that the pre-trained cardinality estimator based on this simplified dataset can still achieve comparable performance to existing models in zero-shot setups.
arXiv Detail & Related papers (2025-02-20T08:06:16Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning [50.809769498312434]
We propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS) Our method achieves 54.51% accuracy with only 10% training data, surpassing random selection by 7.83% and other comparison methods by at least 12.69%.
arXiv Detail & Related papers (2023-11-22T03:45:30Z)
Examining the Effect of Pre-training on Time Series Classification [21.38211396933795]
This study investigates the impact of pre-training followed by fine-tuning on the fine-tuning process. We conducted a thorough examination of 150 classification datasets. We find that pre-training can only help improve the optimization process for models that fit the data poorly. Adding more pre-training data does not improve generalization, but it can strengthen the advantage of pre-training on the original data volume.
arXiv Detail & Related papers (2023-09-11T06:26:57Z)
An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration [11.102950630209879]
In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. We examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration.
arXiv Detail & Related papers (2023-07-17T01:27:10Z)
Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications. Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy. We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)
Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping [62.78338049381917]
Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. We experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials.
arXiv Detail & Related papers (2020-02-15T02:40:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.