Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
- URL: http://arxiv.org/abs/2602.05393v1
- Date: Thu, 05 Feb 2026 07:19:34 GMT
- Title: Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
- Authors: Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie,
- Abstract summary: We propose a Late-to-Early Training (LET) paradigm that enables Large Language Models to learn later knowledge in earlier steps and earlier layers.<n>We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning.<n>Our method achieves up to 1.6$times$ speedup with nearly 5% improvement in downstream task accuracy compared to standard training.
- Score: 24.03797089794804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
Related papers
- Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empirical Study in Mathematical Reasoning [42.80470927369973]
We study how model scale, data volume, and computational budget interact to shape performance.<n>We find that larger models trained for fewer steps consistently outperform smaller models trained for more steps.<n>In data-constrained regimes, repeated reuse of high-quality data proves highly effective.
arXiv Detail & Related papers (2025-09-29T17:10:35Z) - Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z) - Rethinking 1-bit Optimization Leveraging Pre-trained Large Language Models [32.16681909538446]
1-bit LLM quantization offers significant advantages in reducing storage and computational costs.<n>Existing methods typically train 1-bit LLMs from scratch, failing to fully leverage pre-trained models.<n>We introduce a consistent progressive training for both forward and backward, smoothly converting the floating-point weights into the binarized ones.
arXiv Detail & Related papers (2025-08-09T13:00:16Z) - AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies [36.645912291368546]
We present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model with 8 experts with 16 billion parameters each.
This approach optimize performance while minimizing data requirements through a two-stage process.
We successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
arXiv Detail & Related papers (2024-08-13T02:07:00Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length [65.24730341801468]
This paper introduces a novel, simple, and effective method named growlength'' to accelerate the pretraining process of Large Language Models.
Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency.
arXiv Detail & Related papers (2023-10-01T05:25:24Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.