Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
- URL: http://arxiv.org/abs/2405.18392v2
- Date: Wed, 29 May 2024 16:56:26 GMT
- Title: Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
- Authors: Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi,
- Abstract summary: We argue that scale and training research has been needlessly complex due to reliance on the cosine schedule.
We investigate the training behavior of a direct alternative - constant learning rate and schedules - and find that it scales predictably and reliably similar to cosine.
We show that weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales.
- Score: 62.132347451049455
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative - constant learning rate and cooldowns - and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at https://github.com/epfml/schedules-and-scaling.
Related papers
- Custom Gradient Estimators are Straight-Through Estimators in Disguise [3.1037083241174197]
Quantization-aware training comes with a fundamental challenge: the derivative of quantization functions such as rounding are zero almost everywhere.
We prove that when the learning rate is sufficiently small, a large class of weight gradient estimators is equivalent with the straight through estimator (STE)
We experimentally show that these results hold for both a small convolutional model trained on the MNIST dataset and for a ResNet50 model trained on ImageNet.
arXiv Detail & Related papers (2024-05-08T16:07:56Z) - The Power of Few: Accelerating and Enhancing Data Reweighting with Coreset Selection [18.683805940232485]
We introduce a novel method that employs core subset selection for reweighting.
By focusing on a strategically selected coreset, our approach offers a robust representation.
The re-calibrated weights are then mapped back to and propagated across the entire dataset.
arXiv Detail & Related papers (2024-03-18T18:30:22Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Always-Sparse Training by Growing Connections with Guided Stochastic
Exploration [46.4179239171213]
We propose an efficient always-sparse training algorithm with excellent scaling to larger and sparser models.
We evaluate our method on CIFAR-10/100 and ImageNet using VGG, and ViT models, and compare it against a range of sparsification methods.
arXiv Detail & Related papers (2024-01-12T21:32:04Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Dynamic Scale Training for Object Detection [111.33112051962514]
We propose a Dynamic Scale Training paradigm (abbreviated as DST) to mitigate scale variation challenge in object detection.
Experimental results demonstrate the efficacy of our proposed DST towards scale variation handling.
It does not introduce inference overhead and could serve as a free lunch for general detection configurations.
arXiv Detail & Related papers (2020-04-26T16:48:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.