Related papers: Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

URL: http://arxiv.org/abs/2601.14327v1
Date: Tue, 20 Jan 2026 08:39:04 GMT
Title: Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models
Authors: YuanLab. ai, Shawn Wu, Jiangang Luo, Tong Yu, Darcy Chen, Sean Wang, Xudong Zhao, Louie Li, Claire Wang, Hunter He, Carol Wang, Allen Wang,
Abstract summary: This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm for the pre-training stage of Mixture-of-Experts (MoE) Large Language Models (LLMs)<n> Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency.<n>In particular, when pre-training the 1010B Base model from scratch, LAEP achieves a 48.3% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.
Score: 13.25114757425841
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although Mixture-of-Experts (MoE) Large Language Models (LLMs) deliver superior accuracy with a reduced number of active parameters, their pre-training represents a significant computationally bottleneck due to underutilized experts and limited training efficiency. This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. In particular, when pre-training the 1010B Base model from scratch, LAEP achieves a 48.3\% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.

Related papers

Predictive Scaling Laws for Efficient GRPO Training of Large Reasoning Models [0.41942958779358663]
We propose a predictive framework that models training dynamics and helps optimize resource usage.<n>We derive an empirical scaling law based on model size, initial performance, and training progress.<n>We find that training beyond certain number of an epoch offers little gain, suggesting earlier stopping can significantly reduce compute without sacrificing performance.
arXiv Detail & Related papers (2025-07-24T01:09:25Z)
Advantageous Parameter Expansion Training Makes Better Large Language Models [50.82647159657912]
A subset of parameters, termed advantageous parameters, plays a crucial role in determining model performance.<n>We propose Advantageous EXpansion Training (APEX), a method that progressively expands advantageous parameters into the space of disadvantageous ones.<n>APEX achieves the same perplexity level as conventional training with just 33% of the training data, and yields significant improvements on downstream tasks.
arXiv Detail & Related papers (2025-05-30T06:06:23Z)
AmorLIP: Efficient Language-Image Pretraining via Amortization [52.533088120633785]
Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks.<n>We propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks.
arXiv Detail & Related papers (2025-05-25T05:30:37Z)
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws [51.608402959163925]
We present the first systematic exploration of optimal sparse pre-training configurations for large language models.<n>We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss.<n>We propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training.
arXiv Detail & Related papers (2025-01-21T20:23:22Z)
Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more computation-efficient metric for performance estimation.<n>We present FLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training.
arXiv Detail & Related papers (2024-10-11T04:57:48Z)
Forecast-PEFT: Parameter-Efficient Fine-Tuning for Pre-trained Motion Forecasting Models [68.23649978697027]
Forecast-PEFT is a fine-tuning strategy that freezes the majority of the model's parameters, focusing adjustments on newly introduced prompts and adapters. Our experiments show that Forecast-PEFT outperforms traditional full fine-tuning methods in motion prediction tasks. Forecast-FT further improves prediction performance, evidencing up to a 9.6% enhancement over conventional baseline methods.
arXiv Detail & Related papers (2024-07-28T19:18:59Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training [8.411385346896413]
We focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU. We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; and (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution.
arXiv Detail & Related papers (2021-06-07T14:10:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.