Resolving Discrepancies in Compute-Optimal Scaling of Language Models
- URL: http://arxiv.org/abs/2406.19146v4
- Date: Sun, 19 Jan 2025 10:34:08 GMT
- Title: Resolving Discrepancies in Compute-Optimal Scaling of Language Models
- Authors: Tomer Porian, Mitchell Wortsman, Jenia Jitsev, Ludwig Schmidt, Yair Carmon,
- Abstract summary: We explain the discrepancy by reproducing the Kaplan scaling law on two datasets.<n>We find that careful learning rate decay is not essential for the validity of their scaling law.
- Score: 42.82944266028316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.
Related papers
- Towards Robust Scaling Laws for Optimizers [89.21160945066737]
Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
arXiv Detail & Related papers (2026-02-07T21:40:33Z) - Effective Frontiers: A Unification of Neural Scaling Laws [19.808117554175013]
We propose a unified framework that abstracts general learning tasks as the progressive coverage of patterns from a long-tail (Zipfian) distribution.<n>We derive the precise scaling laws for $N$, $D$, and $C$, attributing them to capacity, coverage, and optimization bottlenecks.
arXiv Detail & Related papers (2026-02-01T10:44:46Z) - Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales [55.91454326946738]
We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of languages.<n>We find that scaling the learning rate according to $$P improves transfer, but can still suffer from significant finite-width deviations.<n>For compute-optimal scaling, we find scaling independent weight decay as $1/mathrmwidth$ is nearly optimal across languages.
arXiv Detail & Related papers (2025-12-05T11:03:41Z) - Pretraining Scaling Laws for Generative Evaluations of Language Models [30.6654523997984]
We show three different scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model.<n>Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance.
arXiv Detail & Related papers (2025-09-28T18:04:18Z) - Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law [4.6193503399184275]
Recent works have highlighted difficulties faced by gradient descent in training the first and last layers of transformer-based language models.<n>These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data.<n>We show that the problem is more difficult when the data have heavier tails.
arXiv Detail & Related papers (2025-05-25T16:43:51Z) - Compute-Optimal LLMs Provably Generalize Better With Scale [102.29926217670926]
We develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime.
We introduce a novel, fully empirical Freedman-type martingale concentration that tightens existing bounds by accounting for the variance of the loss function.
We produce a scaling law for the generalization gap, with bounds that become predictably stronger with scale.
arXiv Detail & Related papers (2025-04-21T16:26:56Z) - Do Larger Language Models Generalize Better? A Scaling Law for Implicit Reasoning at Pretraining Time [73.22651918134808]
This work shows counterintuitive effects of model size scaling and provides new insights into the relationship between scaling and reasoning in language models (LMs)<n>We pretrain LMs from scratch on a synthetic implicit multihop reasoning environment designed to replicate the structure and distribution of real-world large-scale knowledge graphs.<n>We then assess the LMs' ability to complete the missing edges in the graph, which requires multi-hop reasoning that can be viewed as a simplification of implicit reasoning during real-world pretraining.
arXiv Detail & Related papers (2025-04-04T17:57:22Z) - Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [46.959380978972206]
We study inference scaling laws (aka test-time scaling laws) and compute-optimal inference.
As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies.
Our findings suggest that scaling inference compute with inference strategies can be more computationally efficient than scaling model parameters.
arXiv Detail & Related papers (2024-08-01T17:16:04Z) - gzip Predicts Data-dependent Scaling Laws [2.5461535398221478]
We generate training datasets of varying complexities by modulating the syntactic properties of a PCFG.
We propose a new data-dependent scaling law for LM's that accounts for the training data's gzip-compressibility.
arXiv Detail & Related papers (2024-05-26T20:33:08Z) - Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles.
We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z) - Selecting Large Language Model to Fine-tune via Rectified Scaling Law [74.84096546112215]
Given constrained resources, fine-tuning all models and making selections afterward is unrealistic.
We find that the fine-tuning scaling curve includes not just the well-known "power phase" but also the previously unobserved "pre-power phase"
By leveraging our law, we propose a novel LLM selection algorithm that selects the near-optimal model with hundreds of times less resource consumption.
arXiv Detail & Related papers (2024-02-04T01:55:00Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws [14.546425605156578]
We modify the Chinchilla scaling laws to calculate the optimal LLM parameter count and pre-training data size to train and deploy a model of a given quality and inference demand.
We train 47 models of varying sizes and parameter counts to validate our formula and find that model quality continues to improve as we scale tokens per parameter to extreme ranges.
arXiv Detail & Related papers (2023-12-31T10:53:58Z) - Scaling Laws Beyond Backpropagation [64.0476282000118]
We study the ability of Direct Feedback Alignment to train causal decoder-only Transformers efficiently.
We find that DFA fails to offer more efficient scaling than backpropagation.
arXiv Detail & Related papers (2022-10-26T10:09:14Z) - Understanding Scaling Laws for Recommendation Models [1.6283945233720964]
We study empirical scaling laws for DLRM style recommendation models, in particular Click-Through Rate (CTR)
We characterize scaling efficiency along three different resource dimensions, namely data, parameters and compute.
We show that parameter scaling is out of steam for the model architecture under study, and until a higher-performing model architecture emerges, data scaling is the path forward.
arXiv Detail & Related papers (2022-08-17T19:13:17Z) - Scaling Laws for Neural Machine Translation [21.76567580425173]
We show that cross-entropy loss as a function of model size follows a certain scaling law.
We also investigate the relationship between the cross-entropy loss and the quality of the translations generated.
arXiv Detail & Related papers (2021-09-16T06:15:20Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.