Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
- URL: http://arxiv.org/abs/2512.24503v1
- Date: Tue, 30 Dec 2025 23:02:44 GMT
- Title: Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
- Authors: Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal,
- Abstract summary: We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyper parameters.<n>We introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training.<n> Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation.
- Score: 109.9635246405237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment: the use of identical small-scale model training configurations across all data recipes in the name of "fair" comparison. We show that the experiment conclusions about data quality can flip with even minor adjustments to training hyperparameters, as the optimal training configuration is inherently data-dependent. Moreover, this fixed-configuration protocol diverges from full-scale model development pipelines, where hyperparameter optimization is a standard step. Consequently, we posit that the objective of data recipe assessment should be to identify the recipe that yields the best performance under data-specific tuning. To mitigate the high cost of hyperparameter tuning, we introduce a simple patch to the evaluation protocol: using reduced learning rates for proxy model training. We show that this approach yields relative performance that strongly correlates with that of fully tuned large-scale LLM pretraining runs. Theoretically, we prove that for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable loss. Empirically, we validate this approach across 23 data recipes covering four critical dimensions of data curation, demonstrating dramatic improvements in the reliability of small-scale experiments.
Related papers
- Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z) - Optimizing the Training Diet: Data Mixture Search for Robust Time Series Forecasting [0.8665758002017515]
We show that, in some cases, "less is more" when considering datasets.<n>We introduce a framework for discovering the optimal "training diet" from a large, unlabeled time series corpus.
arXiv Detail & Related papers (2025-12-12T13:26:07Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Beware of Calibration Data for Pruning Large Language Models [41.1689082093302]
Post-training pruning is a promising method that does not require resource-intensive iterative training.<n> calibration data is also crucial to post-training pruning, especially for high sparsity.<n>We provide a self-generating calibration data synthesis strategy to construct feasible calibration data.
arXiv Detail & Related papers (2024-10-23T09:36:21Z) - Exploring Learning Complexity for Efficient Downstream Dataset Pruning [8.990878450631596]
Existing dataset pruning methods require training on the entire dataset.<n>We propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC)<n>Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters.
arXiv Detail & Related papers (2024-02-08T02:29:33Z) - Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Online Importance Sampling for Stochastic Gradient Optimization [33.42221341526944]
We propose a practical algorithm that efficiently computes data importance on-the-fly during training.<n>We also introduce a novel metric based on the derivative of the loss w.r.t. the network output, designed for mini-batch importance sampling.
arXiv Detail & Related papers (2023-11-24T13:21:35Z) - Dataset Pruning: Reducing Training Data by Examining Generalization
Influence [30.30255670341501]
Do all training data contribute to model's performance?
How to construct a smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance?
arXiv Detail & Related papers (2022-05-19T05:36:35Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.