Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments
- URL: http://arxiv.org/abs/2202.06387v1
- Date: Sun, 13 Feb 2022 19:13:00 GMT
- Title: Scaling Laws Under the Microscope: Predicting Transformer Performance
from Small Scale Experiments
- Authors: Maor Ivgi, Yair Carmon and Jonathan Berant
- Abstract summary: We investigate whether scaling laws can be used to accelerate model development.
We find that scaling laws emerge at finetuning time in some NLP tasks.
For tasks where scaling laws exist, they can be used to predict the performance of larger models.
- Score: 42.793379799720434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural scaling laws define a predictable relationship between a model's
parameter count and its performance after training in the form of a power law.
However, most research to date has not explicitly investigated whether scaling
laws can be used to accelerate model development. In this work, we perform such
an empirical investigation across a wide range of language understanding tasks,
starting from models with as few as 10K parameters, and evaluate downstream
performance across 9 language understanding tasks. We find that scaling laws
emerge at finetuning time in some NLP tasks, and that they can also be
exploited for debugging convergence when training large models. Moreover, for
tasks where scaling laws exist, they can be used to predict the performance of
larger models, which enables effective model selection. However, revealing
scaling laws requires careful hyperparameter tuning and multiple runs for the
purpose of uncertainty estimation, which incurs additional overhead, partially
offsetting the computational benefits.
Related papers
- A Simple Model of Inference Scaling Laws [1.3597551064547502]
We study scaling laws in the context of inference, specifically how performance improves with multiple inference attempts.
Our simple framework sets the ground for incorporating inference scaling with other known scaling laws.
arXiv Detail & Related papers (2024-10-21T18:00:06Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - Scaling Laws Behind Code Understanding Model [4.846512516189021]
We study the scaling law for the code understanding task by varying training data, model size, and computing resource.
We train a large-scale code understanding model named CoLSBERT with 1.5B parameters on a large dataset using more computing resource, which outperforms previous work by a large margin.
arXiv Detail & Related papers (2024-02-20T08:31:42Z) - Mixtures of Experts Unlock Parameter Scaling for Deep RL [54.26191237981469]
In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules into value-based networks results in more parameter-scalable models.
This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
arXiv Detail & Related papers (2024-02-13T17:18:56Z) - Predicting Emergent Abilities with Infinite Resolution Evaluation [85.89911520190711]
We introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase.
We predict the performance of the 2.4B model on code generation with merely 0.05% deviation before training starts.
We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function.
arXiv Detail & Related papers (2023-10-05T02:35:00Z) - A Solvable Model of Neural Scaling Laws [72.8349503901712]
Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws.
We propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology.
Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps.
arXiv Detail & Related papers (2022-10-30T15:13:18Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.