Related papers: Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe

URL: http://arxiv.org/abs/2512.01252v1
Date: Mon, 01 Dec 2025 03:52:31 GMT
Title: Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe
Authors: Yahui Liu, Yang Yue, Jingyuan Zhang, Chenxi Sun, Yang Zhou, Wencong Zeng, Ruiming Tang, Guorui Zhou,
Abstract summary: Recent efforts on Diffusion MoE models have primarily focused on developing more sophisticated routing mechanisms.<n>Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models.<n>We present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks.
Score: 51.26601054313749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent efforts on Diffusion Mixture-of-Experts (MoE) models have primarily focused on developing more sophisticated routing mechanisms. However, we observe that the underlying architectural configuration space remains markedly under-explored. Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models--including DeepSeek-style expert modules, alternative intermediate widths, varying expert counts, and enhanced attention positional encodings. Our systematic study reveals that carefully tuning these configurations is essential for unlocking the full potential of Diffusion MoE models, often yielding gains that exceed those achieved by routing innovations alone. Through extensive experiments, we present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks, which provide a practical and efficient training recipe that enables Diffusion MoE models to surpass strong baselines while using equal or fewer activated parameters. All code and models are publicly available at: https://github.com/yhlleo/EfficientMoE.

Related papers

ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns [68.61814799047956]
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation.<n>We introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations.
arXiv Detail & Related papers (2026-02-17T11:50:58Z)
Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization [60.309915093470416]
Matryoshka MoE (M-MoE) is a training framework that instills a coarse-to-fine structure directly into the expert ensemble.<n>Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
arXiv Detail & Related papers (2025-09-30T16:56:44Z)
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z)
Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts [18.18231276284727]
Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely.<n>Recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts.<n>This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models.
arXiv Detail & Related papers (2025-09-23T02:07:14Z)
MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models [52.876185634349575]
We propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to Large Vision-Language Models (LVLMs)<n>For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts.<n>Our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models.
arXiv Detail & Related papers (2025-08-13T13:00:05Z)
Scaling Laws for Native Multimodal Models [53.490942903659565]
We revisit the architectural design of native multimodal models and conduct an extensive scaling laws study.<n>Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones.<n>We show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.
arXiv Detail & Related papers (2025-04-10T17:57:28Z)
Multi-Architecture Multi-Expert Diffusion Models [18.463425624382115]
We introduce Multi-architecturE Multi-Expert diffusion models (MEME) MEME operates 3.3 times faster than baselines while improving image generation quality (FID scores) by 0.62 (FFHQ) and 0.37 (CelebA) We argue that MEME opens a new design choice for diffusion models that can be easily applied in other scenarios, such as large multi-expert models.
arXiv Detail & Related papers (2023-06-08T07:24:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.