Pretraining Scaling Laws for Generative Evaluations of Language Models
- URL: http://arxiv.org/abs/2509.24012v1
- Date: Sun, 28 Sep 2025 18:04:18 GMT
- Title: Pretraining Scaling Laws for Generative Evaluations of Language Models
- Authors: Rylan Schaeffer, Noam Levi, Brando Miranda, Sanmi Koyejo,
- Abstract summary: We show three different scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model.<n>Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.
- Score: 30.6654523997984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural scaling laws have played a central role in modern machine learning, driving the field's ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance on pretraining losses and on discriminative evaluations such as multiple-choice question-answering, comparatively little research has been done on fitting scaling laws and predicting performance on generative evaluations such as mathematical problem-solving or software engineering. We propose and evaluate three different pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model using the performance of cheaper models. Our three scaling laws differ in the covariates used: (1) compute, (2) model parameters and tokens, (3) log likelihoods of gold reference solutions. We make four main contributions: (1) We show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. (2) In terms of scaling law parameters, we find that the compute scaling law and parameters\,+\,tokens scaling law stabilize for the last ~$1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last ~$5$ orders of magnitude. (3) In terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. (4) We establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.
Related papers
- Towards Robust Scaling Laws for Optimizers [89.21160945066737]
Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
arXiv Detail & Related papers (2026-02-07T21:40:33Z) - Relative Scaling Laws for LLMs [91.73497548097775]
Scaling laws describe how language models improve with additional data, parameters, and compute.<n>We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale.<n>These results show that although scaling improves overall performance, it is not a universal equalizer.
arXiv Detail & Related papers (2025-10-28T16:55:22Z) - Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks [100.13335639780415]
Scaling laws often follow the power-law and proposed several variants of power-law functions to predict the scaling behavior at larger scales.<n>Existing methods mostly rely on point estimation and do not quantify uncertainty, which is crucial for real-world applications.<n>In this work, we explore a Bayesian framework based on Prior-data Fitted Networks (PFNs) for neural scaling law extrapolation.
arXiv Detail & Related papers (2025-05-29T03:19:17Z) - Scaling Law Phenomena Across Regression Paradigms: Multiple and Kernel Approaches [28.569601803576845]
We show that for models with Transformer architecture, the test loss exhibits a power-law relationship with model size, dataset size, and the amount of computation used in training.<n>Our analysis provides deeper insights into the scaling law, potentially enhancing our understanding of Large Language Models.
arXiv Detail & Related papers (2025-03-03T08:57:49Z) - ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model [27.532993606576152]
We introduce a scalable motion generation framework that includes the motion tokenizer MotionQ-VAE and a text FS-VAE transformer.<n>For the first time, we confirm the existence of scaling laws within the context of motion generation.<n>We predict the optimal transformer size, vocabulary size, and data requirements for a compute budget of $1e18$.
arXiv Detail & Related papers (2024-12-19T06:22:19Z) - Resolving Discrepancies in Compute-Optimal Scaling of Language Models [42.82944266028316]
We explain the discrepancy by reproducing the Kaplan scaling law on two datasets.<n>We find that careful learning rate decay is not essential for the validity of their scaling law.
arXiv Detail & Related papers (2024-06-27T13:02:43Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments [42.793379799720434]
We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
arXiv Detail & Related papers (2022-02-13T19:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.