Algorithmic progress in language models
- URL: http://arxiv.org/abs/2403.05812v1
- Date: Sat, 9 Mar 2024 06:26:21 GMT
- Title: Algorithmic progress in language models
- Authors: Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan
Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla
- Abstract summary: We investigate the rate at which algorithms for pre-training language models have improved since the advent of deep learning.
We use a dataset of over 200 language model evaluations on Wikitext and Penn Treebank spanning 2012-2023.
We find that the compute required to reach a set performance threshold has halved approximately every 8 months, with a 95% confidence interval of around 5 to 14 months.
- Score: 1.7402659488193557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the rate at which algorithms for pre-training language models
have improved since the advent of deep learning. Using a dataset of over 200
language model evaluations on Wikitext and Penn Treebank spanning 2012-2023, we
find that the compute required to reach a set performance threshold has halved
approximately every 8 months, with a 95% confidence interval of around 5 to 14
months, substantially faster than hardware gains per Moore's Law. We estimate
augmented scaling laws, which enable us to quantify algorithmic progress and
determine the relative contributions of scaling models versus innovations in
training algorithms. Despite the rapid pace of algorithmic progress and the
development of new architectures such as the transformer, our analysis reveals
that the increase in compute made an even larger contribution to overall
performance improvements over this time period. Though limited by noisy
benchmark data, our analysis quantifies the rapid progress in language
modeling, shedding light on the relative contributions from compute and
algorithms.
Related papers
- Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems [21.01887711305712]
We introduce Recursive INference Scaling (RINS) as a complementary, plug-in recipe for scaling inference time.
For a given fixed model architecture and training compute budget, RINS substantially improves language modeling performance.
RINS delivers gains in multimodal systems, including a +2% improvement in 0-shot ImageNet accuracy for SigLIP-B/16.
arXiv Detail & Related papers (2025-02-11T12:11:40Z) - When, Where and Why to Average Weights? [36.106114687828395]
Averaging checkpoints along the training trajectory is a powerful approach to improve the generalization performance of Machine Learning models.
We show that averaging significantly accelerates training and yields considerable efficiency gains, at the price of a minimal implementation and memory cost.
arXiv Detail & Related papers (2025-02-10T18:40:48Z) - From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models [63.188607839223046]
This survey focuses on the benefits of scaling compute during inference.
We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation.
arXiv Detail & Related papers (2024-06-24T17:45:59Z) - Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance.
Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z) - Algorithmic progress in computer vision [0.8547032097715571]
We investigate algorithmic progress in image classification on ImageNet.
We find that algorithmic improvements have been roughly as important as the scaling of compute for progress computer vision.
compute-augmenting algorithmic advances are made at a pace more than twice as fast as the rate usually associated with Moore's law.
arXiv Detail & Related papers (2022-12-10T00:18:05Z) - Revisiting Neural Scaling Laws in Language and Vision [43.57394336742374]
We argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting parameters.
We present a recipe for estimating scaling law parameters reliably from learning curves.
We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains.
arXiv Detail & Related papers (2022-09-13T09:41:51Z) - Scalable computation of prediction intervals for neural networks via
matrix sketching [79.44177623781043]
Existing algorithms for uncertainty estimation require modifying the model architecture and training procedure.
This work proposes a new algorithm that can be applied to a given trained neural network and produces approximate prediction intervals.
arXiv Detail & Related papers (2022-05-06T13:18:31Z) - Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms.
The learned algorithms are domain-agnostic and can generalize to new environments not seen during training.
We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z) - Efficient Computation of Expectations under Spanning Tree Distributions [67.71280539312536]
We propose unified algorithms for the important cases of first-order expectations and second-order expectations in edge-factored, non-projective spanning-tree models.
Our algorithms exploit a fundamental connection between gradients and expectations, which allows us to derive efficient algorithms.
arXiv Detail & Related papers (2020-08-29T14:58:26Z) - Learning to Stop While Learning to Predict [85.7136203122784]
Many algorithm-inspired deep models are restricted to a fixed-depth'' for all inputs.
Similar to algorithms, the optimal depth of a deep architecture may be different for different input instances.
In this paper, we tackle this varying depth problem using a steerable architecture.
We show that the learned deep model along with the stopping policy improves the performances on a diverse set of tasks.
arXiv Detail & Related papers (2020-06-09T07:22:01Z) - Measuring the Algorithmic Efficiency of Neural Networks [1.1108287264548806]
We show that the number of floating-point operations required to train a classifier to AlexNet-level performance has decreased by a factor of 44x between 2012 and 2019.
This corresponds to algorithmic efficiency doubling every 16 months over a period of 7 years.
We observe that hardware and algorithmic efficiency gains multiply and can be on a similar scale over meaningful horizons, which suggests that a good model of AI progress should integrate measures from both.
arXiv Detail & Related papers (2020-05-08T22:26:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.