Related papers: DataDecide: How to Predict Best Pretraining Data with Small Experiments

DataDecide: How to Predict Best Pretraining Data with Small Experiments

URL: http://arxiv.org/abs/2504.11393v1
Date: Tue, 15 Apr 2025 17:02:15 GMT
Title: DataDecide: How to Predict Best Pretraining Data with Small Experiments
Authors: Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge,
Abstract summary: We release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale.<n>We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds.
Score: 67.95896457895404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

Related papers

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection [37.65064631532493]
Finetuning a pretrained model to perform unsupervised prediction on data from a target domain presents two challenges.<n>We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting.<n>A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.
arXiv Detail & Related papers (2025-02-09T21:44:27Z)
Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU. As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z)
A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
Does your data spark joy? Performance gains from domain upsampling at the end of training [16.572129046599937]
It is expensive to understand the impact of domain-specific datasets on training at large FL model scales. We use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of lower cost compared to full pretraining runs.
arXiv Detail & Related papers (2024-06-05T17:29:15Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data [4.076366901873452]
Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much accuracy may improve from a 2x, 10x, or 50x increase in data size. We propose a process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases.
arXiv Detail & Related papers (2023-11-29T19:10:15Z)
nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales [65.01417261415833]
We present an approach to predict the pre-training loss based on our observations that Maximal Update Parametrization (muP) enables accurate fitting of scaling laws. With around 14% of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B. Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models.
arXiv Detail & Related papers (2023-04-14T00:45:01Z)
On Anytime Learning at Macroscale [33.674452784463774]
In many practical applications, data does not arrive all at once, but in batches over time. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. A tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance.
arXiv Detail & Related papers (2021-06-17T14:45:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.