More Compute Is What You Need
- URL: http://arxiv.org/abs/2404.19484v2
- Date: Thu, 2 May 2024 01:58:15 GMT
- Title: More Compute Is What You Need
- Authors: Zhen Guo,
- Abstract summary: We propose a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models.
We predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
- Score: 3.184416958830696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model pre-training has become increasingly expensive, with most practitioners relying on scaling laws to allocate compute budgets for model size and training tokens, commonly referred to as Compute-Optimal or Chinchilla Optimal. In this paper, we hypothesize a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models, independent of the specific allocation to model size and dataset size. Using this unified scaling law, we predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
Related papers
- Scaling Inference-Efficient Language Models [3.271571137474847]
We show that model architecture affects inference latency, where models of the same size can have up to 3.5x difference in latency.
We modify the Chinchilla scaling laws to co-optimize the model parameter count, the number of training tokens, and the model architecture.
We release the Morph-1B model, which improves inference latency by 1.8x while maintaining accuracy on downstream tasks compared to open-source models.
arXiv Detail & Related papers (2025-01-30T03:16:44Z) - The interplay between domain specialization and model size: a case study in the legal domain [8.653321928148547]
We investigate the interplay between domain and model size during continual pre-training under compute-constrained scenarios.
Our goal is to identify a compute-efficient training regime for this scenario.
As model size increases, the compute-effectiveness gap between specialized and general models widens.
arXiv Detail & Related papers (2025-01-03T19:28:53Z) - Optimizing Sequential Recommendation Models with Scaling Laws and Approximate Entropy [104.48511402784763]
Performance Law for SR models aims to theoretically investigate and model the relationship between model performance and data quality.
We propose Approximate Entropy (ApEn) to assess data quality, presenting a more nuanced approach compared to traditional data quantity metrics.
arXiv Detail & Related papers (2024-11-30T10:56:30Z) - A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Observational Scaling Laws and the Predictability of Language Model Performance [51.2336010244645]
We propose an observational approach that bypasses model training and instead builds scaling laws from 100 publically available models.
We show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models.
We show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
arXiv Detail & Related papers (2024-05-17T17:49:44Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Navigating Scaling Laws: Compute Optimality in Adaptive Model Training [39.96209967632896]
In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data.
We extend the concept of optimality by allowing for an adaptive' model, i.e. a model that can change its shape during training.
arXiv Detail & Related papers (2023-11-06T16:20:28Z) - Scaling Laws for Acoustic Models [7.906034575114518]
Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships.
We show that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws.
arXiv Detail & Related papers (2021-06-11T18:59:24Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z) - Scaling Laws for Neural Language Models [14.472857826717613]
We study scaling laws for language model performance on the cross-entropy loss.
The loss scales as a power-law with model size, dataset size, and the amount of compute used for training.
arXiv Detail & Related papers (2020-01-23T03:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.