Related papers: HMoE: Heterogeneous Mixture of Experts for Language Modeling

HMoE: Heterogeneous Mixture of Experts for Language Modeling

URL: http://arxiv.org/abs/2408.10681v1
Date: Tue, 20 Aug 2024 09:35:24 GMT
Title: HMoE: Heterogeneous Mixture of Experts for Language Modeling
Authors: An Wang, Xingwu Sun, Ruobing Xie, Shuaipeng Li, Jiaqi Zhu, Zhen Yang, Pinxue Zhao, J. N. Han, Zhanhui Kang, Di Wang, Naoaki Okazaki, Cheng-zhong Xu,
Abstract summary: Traditionally, Mixture of Experts (MoE) models use homogeneous experts, each with identical capacity. We propose a novel Heterogeneous Mixture of Experts (HMoE) where experts differ in size and thus possess diverse capacities. HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks.
Score: 45.65121689677227
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture of Experts (MoE) offers remarkable performance and computational efficiency by selectively activating subsets of model parameters. Traditionally, MoE models use homogeneous experts, each with identical capacity. However, varying complexity in input data necessitates experts with diverse capabilities, while homogeneous MoE hinders effective expert specialization and efficient parameter utilization. In this study, we propose a novel Heterogeneous Mixture of Experts (HMoE), where experts differ in size and thus possess diverse capacities. This heterogeneity allows for more specialized experts to handle varying token complexities more effectively. To address the imbalance in expert activation, we propose a novel training objective that encourages the frequent activation of smaller experts, enhancing computational efficiency and parameter utilization. Extensive experiments demonstrate that HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks. Codes will be released upon acceptance.

Related papers

CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning [5.161314094237747]
We propose Contrastive Representation for MoE (CoMoE) to promote modularization and specialization in MoE.<n>Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.
arXiv Detail & Related papers (2025-05-23T06:58:44Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models. Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights. We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Efficiently Editing Mixture-of-Experts Models with Compressed Experts [22.868004724309845]
We propose the concept of compressed experts, lightweight modules that serve as compact representations of full experts. Our approach preserves the most important experts while replacing other auxiliary activated experts with compressed experts.
arXiv Detail & Related papers (2025-03-01T22:00:03Z)
OMoE: Diversifying Mixture of Low-Rank Adaptation by Orthogonal Finetuning [3.8813502422318127]
Building mixture-of-experts (MoE) architecture for Low-rank adaptation (LoRA) is emerging as a potential direction in parameter-efficient fine-tuning (PEFT) We first conduct qualitative analysis to indicate that experts collapse to similar representations in vanilla MoE, limiting the capacity of modular design and computational efficiency. Motivated by these findings, we propose Orthogonal Mixture-of-Experts (OMoE) Our method is simple and alleviates memory bottlenecks, as it incurs minimal experts compared to vanilla MoE models.
arXiv Detail & Related papers (2025-01-17T09:27:08Z)
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts [75.85448576746373]
We propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks.
arXiv Detail & Related papers (2024-07-12T17:25:02Z)
Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models [33.834215393960605]
We introduce the Dynamic Mixture of Experts (DynMoE) technique to enhance the efficiency of training and inference for Transformer-based foundational models. DynMoE incorporates a novel gating method that enables each token to automatically determine the number of experts to activate. Our results demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks.
arXiv Detail & Related papers (2024-05-23T08:18:30Z)
Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models. Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z)
HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts [25.504602853436047]
Mixture of Experts (MoE) for language models has been proven effective in augmenting the capacity of models by dynamically routing each input token to a specific subset of experts for processing. We propose HyperMoE, a novel MoE framework built upon Hypernetworks. This framework integrates the computational processes of MoE with the concept of knowledge transferring in multi-task learning.
arXiv Detail & Related papers (2024-02-20T02:09:55Z)
MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts [15.535613294871487]
We propose a method called Mixture-of-Distilled-Expert (MoDE) MoDE applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts.
arXiv Detail & Related papers (2024-01-31T03:52:32Z)
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z)
Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity [37.04254056062765]
Stratified Mixture of Experts (SMoE) models can assign dynamic capacity to different tokens. We show SMoE outperforms multiple state-of-the-art MoE models with the same or fewer parameters.
arXiv Detail & Related papers (2023-05-03T15:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.