The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts
- URL: http://arxiv.org/abs/2505.06839v1
- Date: Sun, 11 May 2025 04:35:40 GMT
- Title: The power of fine-grained experts: Granularity boosts expressivity in Mixture of Experts
- Authors: Enric Boix-Adsera, Philippe Rigollet,
- Abstract summary: This paper investigates the impact of the number of active experts, termed granularity, on frontier model architectures.<n>We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity.
- Score: 6.892193480589255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) layers are increasingly central to frontier model architectures. By selectively activating parameters, they reduce computational cost while scaling total parameter count. This paper investigates the impact of the number of active experts, termed granularity, comparing architectures with many (e.g., 8 per layer in DeepSeek) to those with fewer (e.g., 1 per layer in Llama-4 models). We prove an exponential separation in network expressivity based on this design parameter, suggesting that models benefit from higher granularity. Experimental results corroborate our theoretical findings and illustrate this separation.
Related papers
- On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating [75.29576838162714]
DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism.<n>We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating.
arXiv Detail & Related papers (2025-05-16T04:58:18Z) - Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models [24.64757529640278]
Cluster-driven Expert Pruning (C-Prune) is a novel two-stage framework for adaptive task-specific compression of large language models.<n>C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer.<n>We validate C-Prune through extensive experiments on multiple MoE models and benchmarks.
arXiv Detail & Related papers (2025-04-10T14:46:26Z) - Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z) - Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.<n>This refers to the cumulative effect of reconstruction errors throughout the sparsification process.<n>We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts [82.74439280067492]
Finedeep is a deep-layered fine-grained expert architecture for dense models.<n>Our framework partitions the feed-forward neural network layers of traditional dense models into small experts.<n>A novel routing mechanism is proposed to determine each expert's contribution.
arXiv Detail & Related papers (2025-02-18T15:09:58Z) - Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency.
We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures.
The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z) - Mixture of A Million Experts [1.240096657086732]
This paper introduces PEER, a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of experts.
Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.
arXiv Detail & Related papers (2024-07-04T20:59:20Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z) - A Novel Architecture Slimming Method for Network Pruning and Knowledge
Distillation [30.39128740788747]
We propose an architecture slimming method that automates the layer configuration process.
We show that our method shows significant performance gain over baselines after pruning and distillation.
Surprisingly, we find that the resulting layer-wise compression rates correspond to the layer sensitivities found by existing works.
arXiv Detail & Related papers (2022-02-21T12:45:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.