Scaling Optimal LR Across Token Horizons
- URL: http://arxiv.org/abs/2409.19913v2
- Date: Wed, 2 Oct 2024 17:03:25 GMT
- Title: Scaling Optimal LR Across Token Horizons
- Authors: Johan Bjorck, Alon Benhaim, Vishrav Chaudhary, Furu Wei, Xia Song,
- Abstract summary: We show how optimal learning rate depends on token horizon in LLM training.
We also provide evidence that LLama-1 used too high LR, and estimate the performance hit from this.
- Score: 81.29631219839311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size and cluster size. It is economically infeasible to extensively tune hyperparameter for the largest runs. Instead, approximately optimal hyperparameters must be inferred or \textit{transferred} from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large scale empirical study on how optimal learning rate (LR) depends on token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly we demonstrate the the optimal LR follows a scaling law, and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly we provide evidence that LLama-1 used too high LR, and estimate the performance hit from this. We thus argue that hyperparameter transfer across data size is an important and overlooked component of LLM training.
Related papers
- Where Do Large Learning Rates Lead Us? [5.305784285588872]
We show that only a narrow range of initial LRs leads to optimal results after fine-tuning with a small LR or weight averaging.
We show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.
In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization.
arXiv Detail & Related papers (2024-10-29T15:14:37Z) - Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks.
Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum.
We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Optimization Hyper-parameter Laws for Large Language Models [56.322914260197734]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.
Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.
This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z) - Scaling Law with Learning Rate Annealing [4.121865876406014]
Cross-entropy loss curves of neural language models adhere to a scaling law with learning rate (LR) annealing over training steps.
Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate (LRS)
arXiv Detail & Related papers (2024-08-20T17:30:48Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Temporal Scaling Law for Large Language Models [24.12384260752973]
We propose the novel concept of Temporal Scaling Law, studying how the test loss of an LLM evolves as the training steps scale up.
In contrast to modeling the test loss as a whole in a coarse-grained manner, we break it down and dive into the fine-grained test loss of each token position.
We derive the much more precise temporal scaling law by studying the temporal patterns of the parameters in the dynamic hyperbolic-law.
arXiv Detail & Related papers (2024-04-27T05:49:11Z) - MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks [56.66010634895913]
The learning rate (LR) is one of the most important hyper-learned network parameters in gradient descent (SGD) training networks (DNN)
In this paper, we propose to learn a proper LR schedule for MLR-SNet tasks.
We also make MLR-SNet to query tasks like different noises, architectures, data modalities, sizes from the training ones, and achieve or even better performance.
arXiv Detail & Related papers (2020-07-29T01:18:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.