Related papers: Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training

URL: http://arxiv.org/abs/2405.15319v2
Date: Tue, 22 Oct 2024 10:31:59 GMT
Title: Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training
Authors: Wenyu Du, Tongxu Luo, Zihan Qiu, Zeyu Huang, Yikang Shen, Reynold Cheng, Yike Guo, Jie Fu,
Abstract summary: This work identifies three critical $textitO$bstacles: lack of comprehensive evaluation, ($textitO$2) untested viability for scaling, and ($textitO$3) lack of empirical guidelines. We show that a depthwise stacking operator, called $G_textstack$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance.
Score: 42.89066583603415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLMs are computationally expensive to pre-train due to their large scale. Model growth emerges as a promising approach by leveraging smaller models to accelerate the training of larger ones. However, the viability of these model growth methods in efficient LLM pre-training remains underexplored. This work identifies three critical $\underline{\textit{O}}$bstacles: ($\textit{O}$1) lack of comprehensive evaluation, ($\textit{O}$2) untested viability for scaling, and ($\textit{O}$3) lack of empirical guidelines. To tackle $\textit{O}$1, we summarize existing approaches into four atomic growth operators and systematically evaluate them in a standardized LLM pre-training setting. Our findings reveal that a depthwise stacking operator, called $G_{\text{stack}}$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance on eight standard NLP benchmarks compared to strong baselines. Motivated by these promising results, we conduct extensive experiments to delve deeper into $G_{\text{stack}}$ to address $\textit{O}$2 and $\textit{O}$3. For $\textit{O}$2 (untested scalability), our study shows that $G_{\text{stack}}$ is scalable and consistently performs well, with experiments up to 7B LLMs after growth and pre-training LLMs with 750B tokens. For example, compared to a conventionally trained 7B model using 300B tokens, our $G_{\text{stack}}$ model converges to the same loss with 194B tokens, resulting in a 54.6\% speedup. We further address $\textit{O}$3 (lack of empirical guidelines) by formalizing guidelines to determine growth timing and growth factor for $G_{\text{stack}}$, making it practical in general LLM pre-training. We also provide in-depth discussions and comprehensive ablation studies of $G_{\text{stack}}$. Our code and pre-trained model are available at https://llm-stacking.github.io.

Related papers

Runaway is Ashamed, But Helpful: On the Early-Exit Behavior of Large Language Model-based Agents in Embodied Environments [55.044159987218436]
Large language models (LLMs) have demonstrated strong planning and decision-making capabilities in complex embodied environments.<n>We take a first step toward exploring the early-exit behavior for LLM-based agents.
arXiv Detail & Related papers (2025-05-23T08:23:36Z)
J1: Exploring Simple Test-Time Scaling for LLM-as-a-Judge [24.607213170485743]
This paper introduces $textbfJ1-7B$, which is first supervised fine-tuned on reflection-enhanced datasets collected via rejection-sampling.<n>At inference time, we apply Simple Test-Time Scaling (STTS) strategies for additional performance improvement.<n> Experimental results demonstrate that $textbfJ1-7B$ surpasses the previous state-of-the-art LLM-as-a-Judge by $ textbf4.8$% and exhibits a $ textbf5.1$% stronger scaling trend under STTS.
arXiv Detail & Related papers (2025-05-17T06:58:42Z)
Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency [26.173523821684306]
A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. Experiments on large language models with $7 sim 70$ billion parameters show that $D3$ can achieve an average 1.5x speedup compared with the full-inference pipeline.
arXiv Detail & Related papers (2025-03-11T15:15:54Z)
Towards An Efficient LLM Training Paradigm for CTR Prediction [37.20013051226115]
Large Language Models (LLMs) can significantly outperform conventional click-through-rate (CTR) prediction approaches.<n>To train LLMs for CTR prediction, most existing studies adopt the prevalent ''sliding-window'' paradigm.<n>We propose a novel training paradigm, namely Dynamic Target Isolation (DTI), that structurally parallelizes the training of $k$ target interactions.
arXiv Detail & Related papers (2025-03-02T19:43:35Z)
Towards Efficient Automatic Self-Pruning of Large Language Models [55.90119819642064]
Post-training structured pruning is a promising solution that prunes Large Language Models without the need for retraining. We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer. We introduce $textbfSelf-Pruner$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
arXiv Detail & Related papers (2025-02-20T09:59:50Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models [77.79855507792564]
This paper revisits the implementation of $textbfL$oad-$textbfb$alancing $textbfL$oss (LBL) when training Mixture-of-Experts (MoEs) models.
arXiv Detail & Related papers (2025-01-21T04:04:39Z)
Control LLM: Controlled Evolution for Intelligence Retention in LLM [4.67235851066221]
We propose textbfControl LLM, a novel approach that leverages parallel pre-trained and expanded transformer blocks. Experiments demonstrate the effectiveness of Control LLM in both Continuous Pre-training (CPT) and Continuous Supervised Fine-Tuning (CSFT) It surpasses existing methods and achieves SOTA among open-source models tuned from the same base model, using substantially less data and compute.
arXiv Detail & Related papers (2025-01-19T08:06:06Z)
Reasoning Robustness of LLMs to Adversarial Typographical Errors [49.99118660264703]
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting. We study the reasoning robustness of LLMs to typographical errors, which can naturally occur in users' queries. We design an Adversarial Typo Attack ($texttATA$) algorithm that iteratively samples typos for words that are important to the query and selects the edit that is most likely to succeed in attacking.
arXiv Detail & Related papers (2024-11-08T05:54:05Z)
Large Language Models Are Overparameterized Text Encoders [17.608805125623803]
Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. We show that by pruning the last $p%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time.
arXiv Detail & Related papers (2024-10-18T16:26:45Z)
FLARE: Faithful Logic-Aided Reasoning and Exploration [50.9814063216852]
We introduce a novel approach for traversing the problem space using task decompositions. We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code. Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z)
Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG [57.14250086701313]
We investigate the extent to which modern LMs generate $n$-grams from their training data. We develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data.
arXiv Detail & Related papers (2024-06-18T21:31:19Z)
An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding [25.20222970947923]
We propose a method to extend the context length of pre-trained large language models (LLMs) $textttCREAM$ interpolates positional encodings by manipulating position indices. Experiments show that $textttCREAM$ successfully extends LLMs to the target length for both Base and Chat versions of $texttLlama2-7B$ with "Never Miss A Beat"
arXiv Detail & Related papers (2024-06-11T10:35:49Z)
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models [22.425339110551743]
We introduce $textitweak-to-strong search, framing the alignment of a large language model as a test-time greedy search. In controlled-sentiment generation and summarization, we use tuned and untuned $textttgpt2$s to improve the alignment of large models without additional training. In a more difficult instruction-following benchmark, we show that reusing off-the-shelf small models can improve the length-controlled win rates of both white-box and black-box large models.
arXiv Detail & Related papers (2024-05-29T16:55:32Z)
Can Large Language Models Play Games? A Case Study of A Self-Play Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z)
Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning [54.585428241509234]
We propose R$3$: Learning Reasoning through Reverse Curriculum Reinforcement Learning (RL) RL employs only outcome supervision to achieve the benefits of process supervision for large language models.
arXiv Detail & Related papers (2024-02-08T16:46:26Z)
Intention Analysis Makes LLMs A Good Jailbreak Defender [79.4014719271075]
In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($mathbbIA$) The principle behind this is to trigger LLMs' inherent self-correct and improve ability through a two-stage process. $mathbbIA$ is an inference-only method, thus could enhance the safety of LLMs without compromising their helpfulness.
arXiv Detail & Related papers (2024-01-12T13:15:05Z)
Towards Understanding Clean Generalization and Robust Overfitting in Adversarial Training [38.44734564565478]
We study the $textitClean Generalization and Robust Overfitting phenomenon in adversarial training. We show that a three-stage phase transition occurs during learning process and the network converges to robust memorization regime. We also empirically verify our theoretical analysis by experiments in real-image recognition.
arXiv Detail & Related papers (2023-06-02T05:07:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.