Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
- URL: http://arxiv.org/abs/2405.15319v2
- Date: Tue, 22 Oct 2024 10:31:59 GMT
- Title: Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
- Authors: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu,
- Abstract summary: This work identifies three critical $textitO$bstacles: lack of comprehensive evaluation, ($textitO$2) untested viability for scaling, and ($textitO$3) lack of empirical guidelines.
We show that a depthwise stacking operator, called $G_textstack$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance.
- Score: 42.89066583603415
- License:
- Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://llm-stacking.github.io.
Related papers
- Large Language Models Are Overparameterized Text Encoders [17.608805125623803]
Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training.
We show that by pruning the last $p%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time.
arXiv Detail & Related papers (2024-10-18T16:26:45Z) - FLARE: Faithful Logic-Aided Reasoning and Exploration [50.9814063216852]
We introduce a novel approach for traversing the problem space using task decompositions.
We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.
Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z) - Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data.
We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z) - An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding [25.20222970947923]
We propose a method to extend the context length of pre-trained large language models (LLMs)
$textttCREAM$ interpolates positional encodings by manipulating position indices.
Experiments show that $textttCREAM$ successfully extends LLMs to the target length for both Base and Chat versions of $texttLlama2-7B$ with "Never Miss A Beat"
arXiv Detail & Related papers (2024-06-11T10:35:49Z) - Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search.
In controlled-sentiment generation and summarization, we use tuned and untuned $textttgpt2$s to improve the alignment of large models without additional training.
In a more difficult instruction-following benchmark, we show that reusing off-the-shelf small models can improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - Intention Analysis Makes LLMs A Good Jailbreak Defender [79.4014719271075]
In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($mathbbIA$)
The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process.
$mathbbIA$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness.
arXiv Detail & Related papers (2024-01-12T13:15:05Z) - Local Convergence of Approximate Newton Method for Two Layer Nonlinear
Regression [21.849997443967705]
Two-layer regression problem has been well-studied in prior works.
First layer is activated by a ReLU unit, and the second layer is activated by a softmax unit.
We prove that the loss function for the Hessian matrix is positive definite and Lipschitz continuous under certain assumptions.
arXiv Detail & Related papers (2023-11-26T19:19:02Z) - Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training [38.44734564565478]
We study the $textitClean Generalization and Robust Overfitting phenomenon in adversarial training.
We show that a three-stage phase transition occurs during learning process and the network converges to robust memorization regime.
We also empirically verify our theoretical analysis by experiments in real-image recognition.
arXiv Detail & Related papers (2023-06-02T05:07:42Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.