Related papers: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

URL: http://arxiv.org/abs/2505.20225v1
Date: Mon, 26 May 2025 17:06:25 GMT
Title: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
Authors: Hao Kang, Zichun Yu, Chenyan Xiong,
Abstract summary: We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models.<n>FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs.
Score: 19.984973014373118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

Related papers

Evaluating the Use of LLMs for Documentation to Code Traceability [3.076436880934678]
Large Language Models can establish trace links between various software documentation and source code.<n>We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI)<n>Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets.
arXiv Detail & Related papers (2025-06-19T16:18:53Z)
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer [5.585222292493927]
We propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement selective routing on input data and experts.<n>Experiments demonstrate that UoE model surpass Full Attention, state-of-art MoEs and efficient transformers.
arXiv Detail & Related papers (2025-03-04T11:01:25Z)
CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom [45.382739152668954]
Distilling advanced Large Language Models' instruction-following capabilities into smaller models has become a mainstream approach in model training.<n>We investigate more diverse signals to capture comprehensive instruction-response pair characteristics.<n>We propose CrowdSelect, an integrated metric incorporating a clustering-based approach to maintain response diversity.
arXiv Detail & Related papers (2025-03-03T18:56:44Z)
DeepSeek-V3 Technical Report [147.16121855209246]
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token.<n>We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages.<n> Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models.
arXiv Detail & Related papers (2024-12-27T04:03:16Z)
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models [7.164238322896674]
emphLibMoE is a comprehensive framework to streamline the research, training, and evaluation of MoE algorithms. LibMoE brings MoE in large language models (LLMs) more accessible to a wide range of researchers by standardizing the training and evaluation pipelines.
arXiv Detail & Related papers (2024-11-01T14:04:36Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts [95.26323548734692]
MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
arXiv Detail & Related papers (2024-07-31T17:46:51Z)
DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.<n>We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.<n>Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z)
FedLLM-Bench: Realistic Benchmarks for Federated Learning of Large Language Models [48.484485609995986]
Federated learning has enabled multiple parties to collaboratively train large language models without directly sharing their data (FedLLM) There are currently no realistic datasets and benchmarks for FedLLM. We propose FedLLM-Bench, which involves 8 training methods, 4 training datasets, and 6 evaluation metrics.
arXiv Detail & Related papers (2024-06-07T11:19:30Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.