Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
- URL: http://arxiv.org/abs/2405.15319v1
- Date: Fri, 24 May 2024 08:00:00 GMT
- Title: Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
- Authors: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu,
- Abstract summary: We show that a depthwise stacking operator, called $G_textstack$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance.
For $textitO$2 (untested scalability), our study shows that $G_textstack$ is scalable and consistently performs well.
We further address $textitO$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_textstack$.
- Score: 42.89066583603415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at $\href{https://llm-stacking.github.io/}{https://llm-stacking.github.io/}$.
Related papers
- Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - Never Miss A Beat: An Efficient Recipe for Context Window Extension of Large Language Models with Consistent "Middle" Enhancement [25.20222970947923]
We propose $textbfC$ontinuity-$textbfR$elativity ind$textbfE$xing with g$textbfA$ussian $textbfM$iddle (CREAM), which interpolates positional encodings by manipulating position indices.
Experiments show that CREAM successfully extends LLMs to the target length for both Base and Chat versions of $textttLlama2-7B$ with Never Miss A Beat''
arXiv Detail & Related papers (2024-06-11T10:35:49Z) - Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search.
We show that reusing off-the-shelf small model pairs can significantly improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z) - Comparing Plausibility Estimates in Base and Instruction-Tuned Large Language Models [50.15455336684986]
We compare base and instruction-tuned LLM performance on an English sentence plausibility task via explicit prompting and implicit estimation.
Experiment 1 shows that, across model architectures and plausibility datasets, log likelihood ($textitLL$) scores are the most reliable indicator of sentence plausibility.
Experiment 2 shows that $textitLL$ scores across models are modulated by context in the expected way, showing high performance on three metrics of context-sensitive plausibility.
arXiv Detail & Related papers (2024-03-21T22:08:44Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - Learning Thresholds with Latent Values and Censored Feedback [18.129896050051432]
We show a problem where the unknown reward $g(gamma, v)$ depends on the proposed threshold $gamma$ and latent value $v$ and it can be $only$ achieved if the threshold is lower than or equal to the unknown latent value.
This problem has broad applications in practical scenarios, e.g., reserve price optimization in online auctions, online task assignments in crowdsourcing, setting recruiting bars in hiring.
arXiv Detail & Related papers (2023-12-07T19:30:08Z) - Towards Understanding Clean Generalization and Robust Overfitting in
Adversarial Training [45.42044569913022]
We study the $textitClean Generalization and Robust Overfitting phenomenon in adversarial training.
We show that a three-stage phase transition occurs during learning process and the network converges to robust memorization regime.
We also empirically verify our theoretical analysis by experiments in real-image recognition.
arXiv Detail & Related papers (2023-06-02T05:07:42Z) - Deep Learning Meets Projective Clustering [66.726500395069]
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $AinmathbbRntimes d$.
Inspired by emphprojective clustering from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces.
arXiv Detail & Related papers (2020-10-08T22:47:48Z) - Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation [30.137884459159107]
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces.
We develop a simple, iterative learning algorithm that finds $epsilon$-Schmidt $Q$-function with sample complexity of $widetildeO(frac1epsilonmax(d1), d_2)+2)$ when the optimal $Q$-function has low rank $r$ and the factor $$ is below a certain threshold.
arXiv Detail & Related papers (2020-06-11T00:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.