Related papers: MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models

MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models

URL: http://arxiv.org/abs/2601.06857v1
Date: Sun, 11 Jan 2026 10:59:15 GMT
Title: MoE-DisCo:Low Economy Cost Training Mixture-of-Experts Models
Authors: Xin Ye, Daning Cheng, Boyang Zhang, Yunquan Zhang,
Abstract summary: Training large-scale Mixture-of-Experts (MoE) models requires high-memory, high-bandwidth GPUs (e.g., A100)<n>MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering.
Score: 6.372179935695467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned globally for a short period on high-memory, high-bandwidth GPUs. Experiments show that our method matches or even surpasses full-parameter training in performance across multiple downstream tasks, loss function, and perplexity (PPL), while reducing training cost by 47.6 percent to 69.5 percent on Qwen1.5-MoE-2.7B and Llama-MoE-3.5B across different datasets.

Related papers

Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs [80.72350166388601]
Nemotron Elastic is a framework for building reasoning-oriented LLMs.<n>It embeds nested submodels within a single parent model.<n>Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment.
arXiv Detail & Related papers (2025-11-20T18:59:21Z)
PC-MoE: Memory-Efficient and Privacy-Preserving Collaborative Training for Mixture-of-Experts LLMs [56.04036826558497]
We introduce Privacy-preserving Collaborative Mixture-of-Experts (PC-MoE)<n>By design, PC-MoE synergistically combines the strengths of distributed computation with strong confidentiality assurances.<n>It almost matches (and sometimes exceeds) the performance and convergence rate of a fully centralized model, enjoys near 70% peak GPU RAM reduction, while being fully robust against reconstruction attacks.
arXiv Detail & Related papers (2025-06-03T15:00:18Z)
Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training [10.794896407061076]
We propose a pre-training and fine-tuning framework based on block descent coordinate (BCD)<n>Under identical hardware configurations, we reduce the training cost of a 7B model to 33% on A100/800A clusters.
arXiv Detail & Related papers (2025-05-23T03:05:54Z)
AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.<n>Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.<n>We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
2 OLMo 2 Furious [154.15728448754854]
We present OLMo 2, the next generation of our fully open language models.<n> OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales.<n>We describe our modified model architecture and training recipe.
arXiv Detail & Related papers (2024-12-31T21:55:10Z)
No Need to Talk: Asynchronous Mixture of Language Models [25.3581396758015]
Smalltalk LM is an innovative method for training a mixture of language models in an almost asynchronous manner.<n>At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix.<n>Experiments on language modeling demonstrate that SMALLTALK LM achieves significantly lower perplexity than dense model baselines.
arXiv Detail & Related papers (2024-10-04T15:50:10Z)
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance. We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z)
Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs) We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z)
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training [13.346719319555943]
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model. Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
arXiv Detail & Related papers (2023-03-11T05:38:15Z)
Deep Model Assembling [31.88606253639418]
This paper studies a divide-and-conquer strategy to train large models. It divides a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model. We introduce a global, shared meta model to implicitly link all the modules together. This enables us to train highly compatible modules that collaborate effectively when they are assembled together.
arXiv Detail & Related papers (2022-12-08T08:04:06Z)
MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage. For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.