DataRater: Meta-Learned Dataset Curation
- URL: http://arxiv.org/abs/2505.17895v1
- Date: Fri, 23 May 2025 13:43:14 GMT
- Title: DataRater: Meta-Learned Dataset Curation
- Authors: Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa M. Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeffrey Dean, Hado van Hasselt, David Silver,
- Abstract summary: We propose emphDataRater, which estimates the value of training on any particular data point.<n>It is done by meta-learning using meta-gradients', with the objective of improving training efficiency on held out data.<n>In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective.
- Score: 40.90328309013541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emph{learn} which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emph{DataRater} is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients', with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.
Related papers
- Info-Coevolution: An Efficient Framework for Data Model Coevolution [11.754869657967207]
We propose a novel framework that enables models and data to coevolve through online selective annotation with no bias.<n>For real-world datasets like ImageNet-1K, Info-Coevolution reduces annotation and training costs by 32% without performance loss.
arXiv Detail & Related papers (2025-06-09T17:04:11Z) - Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training [26.65053392031144]
We propose a series of alternative training paradigms that leverage insights from hard-data-mining and dropout.<n>The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline.<n>Surprisingly, the proposed method improves accuracy by up to 4.82%.
arXiv Detail & Related papers (2025-05-28T13:26:52Z) - Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z) - How to Achieve Higher Accuracy with Less Training Points? [2.1834099301440526]
We propose a technique based on influence functions to determine which training samples should be included in the training set.<n>Our approach demonstrates performance comparable to that of training on the entire dataset while using only 10% of the data.
arXiv Detail & Related papers (2025-04-18T09:38:26Z) - Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z) - When Less is More: Investigating Data Pruning for Pretraining LLMs at
Scale [12.94829977468838]
Large volumes of text data have contributed significantly to the development of large language models.
To date, efforts to prune datasets down to a higher quality subset have relied on hand-crafteds encoded as rule-based filters.
We take a wider view and explore scalable estimates of data quality that can be used to measure the quality of pretraining data.
arXiv Detail & Related papers (2023-09-08T19:34:05Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Dataset Distillation by Matching Training Trajectories [75.9031209877651]
We propose a new formulation that optimize our distilled data to guide networks to a similar state as those trained on real data.
Given a network, we train it for several iterations on our distilled data and optimize the distilled data with respect to the distance between the synthetically trained parameters and the parameters trained on real data.
Our method handily outperforms existing methods and also allows us to distill higher-resolution visual data.
arXiv Detail & Related papers (2022-03-22T17:58:59Z) - Invariance Learning in Deep Neural Networks with Differentiable Laplace
Approximations [76.82124752950148]
We develop a convenient gradient-based method for selecting the data augmentation.
We use a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective.
arXiv Detail & Related papers (2022-02-22T02:51:11Z) - Gradient-guided Loss Masking for Neural Machine Translation [27.609155878513334]
In this paper, we explore strategies that dynamically optimize data usage during the training process.
Our algorithm calculates the gradient alignment between the training data and the clean data to mask out data with negative alignment.
Experiments on three WMT language pairs show that our method brings significant improvement over strong baselines.
arXiv Detail & Related papers (2021-02-26T15:41:48Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.