Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
- URL: http://arxiv.org/abs/2510.08008v1
- Date: Thu, 09 Oct 2025 09:45:45 GMT
- Title: Recycling Pretrained Checkpoints: Orthogonal Growth of Mixture-of-Experts for Efficient Large Language Model Pre-Training
- Authors: Ruizhe Wang, Yucheng Ding, Xiao Liu, Yaoxiang Wang, Peng Cheng, Baining Guo, Zhengjun Zha, Yeyun Gong,
- Abstract summary: We propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training.<n>We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch.
- Score: 70.60554423630803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapidly increasing computational cost of pretraining Large Language Models necessitates more efficient approaches. Numerous computational costs have been invested in existing well-trained checkpoints, but many of them remain underutilized due to engineering constraints or limited model capacity. To efficiently reuse this "sunk" cost, we propose to recycle pretrained checkpoints by expanding their parameter counts and continuing training. We propose orthogonal growth method well-suited for converged Mixture-of-Experts model: interpositional layer copying for depth growth and expert duplication with injected noise for width growth. To determine the optimal timing for such growth across checkpoints sequences, we perform comprehensive scaling experiments revealing that the final accuracy has a strong positive correlation with the amount of sunk cost, indicating that greater prior investment leads to better performance. We scale our approach to models with 70B parameters and over 1T training tokens, achieving 10.66% accuracy gain over training from scratch under the same additional compute budget. Our checkpoint recycling approach establishes a foundation for economically efficient large language model pretraining.
Related papers
- Thinking Augmented Pre-training [88.04395622064708]
Thinking augmented Pre-Training is a universal methodology that augments text with automatically generated thinking trajectories.<n>This paper introduces a simple and scalable approach to improve the data efficiency of large language model (LLM) training by augmenting existing text data with thinking trajectories.
arXiv Detail & Related papers (2025-09-24T14:45:13Z) - Train Long, Think Short: Curriculum Learning for Efficient Reasoning [51.506559652495476]
We propose a curriculum learning strategy for length-controlled reasoning.<n>Our method starts with generous token budgets and gradually tightens them over training.<n>Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines.
arXiv Detail & Related papers (2025-08-12T13:48:03Z) - IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining [50.53912352342753]
We propose an integrated enlarge-and-prune pipeline, which combines enlarge model training, pruning, and recovery.<n>We conduct experiments on compressing 2.8B models to 1.3B with up to 2T tokens in pretraining.<n>It demonstrates the integrated approach not only provides insights into the token efficiency of enlarged model pretraining but also achieves superior performance of pruned models.
arXiv Detail & Related papers (2025-03-07T20:35:31Z) - Training Language Models to Reason Efficiently [14.390800014819439]
We use reinforcement learning to train large reasoning models to reason efficiently.<n>Our method incentivizes models to minimize unnecessary computational overhead while maintaining accuracy.<n> Experiments on two open-weight large reasoning models demonstrate significant reductions in inference cost while preserving most of the accuracy.
arXiv Detail & Related papers (2025-02-06T19:18:16Z) - The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z) - On the Transformer Growth for Progressive BERT Training [37.57617077192438]
We find that similar to network architecture search, Transformer growth also favors compound scaling.
In light of our analyses, the proposed method speeds up BERT pre-training by 73.6% and 82.2% for the base and large models respectively.
arXiv Detail & Related papers (2020-10-23T17:44:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.