Related papers: Multi-Head Mixture-of-Experts

Multi-Head Mixture-of-Experts

URL: http://arxiv.org/abs/2404.15045v1
Date: Tue, 23 Apr 2024 13:47:09 GMT
Title: Multi-Head Mixture-of-Experts
Authors: Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei,
Abstract summary: We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
Score: 100.60556163597946
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.

Related papers

Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z)
Training Matryoshka Mixture-of-Experts for Elastic Inference-Time Expert Utilization [60.309915093470416]
Matryoshka MoE (M-MoE) is a training framework that instills a coarse-to-fine structure directly into the expert ensemble.<n>Our work paves the way for more practical and adaptable deployments of large-scale MoE models.
arXiv Detail & Related papers (2025-09-30T16:56:44Z)
AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models [15.495054753492584]
We propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups.<n>We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding.<n>AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.
arXiv Detail & Related papers (2025-09-16T06:16:05Z)
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal Large Language Models [25.12002287083368]
Multi-modal large language models (MLLMs) have increasingly adopted MoE techniques.<n>MoE experts are often by simply replicating the FFN parameters from LLMs.<n>Expert uniformity occurs because MoE experts are often by simply replicating the FFN parameters from LLMs.<n> router rigidity stems from the prevalent use of static linear routers for expert selection.
arXiv Detail & Related papers (2025-05-28T08:38:39Z)
Unveiling Hidden Collaboration within Mixture-of-Experts in Large Language Models [5.211806751260724]
We propose a hierarchical sparse dictionary learning (HSDL) method that uncovers the collaboration patterns among experts. We also introduce the Contribution-Aware Expert Pruning (CAEP) algorithm, which effectively prunes low-contribution experts.
arXiv Detail & Related papers (2025-04-16T04:06:15Z)
TAMP: Token-Adaptive Layerwise Pruning in Multimodal Large Language Models [23.916205754112774]
Multimodal Large Language Models (MLLMs) have shown remarkable versatility in understanding diverse multimodal data and tasks. We propose TAMP, a simple yet effective pruning framework tailored for MLLMs. We validate our method on two state-of-the-art MLLMs: LLaVA-NeXT, designed for vision-language tasks, and VideoLLaMA2, capable of processing audio, visual, and language modalities.
arXiv Detail & Related papers (2025-04-14T05:44:38Z)
MH-MoE: Multi-Head Mixture-of-Experts [119.47867308669764]
Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts.<n>We present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models.
arXiv Detail & Related papers (2024-11-25T09:05:36Z)
Retraining-Free Merging of Sparse Mixture-of-Experts via Hierarchical Clustering [14.858134039539697]
We propose Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE) HC-SMoE is a task-agnostic expert merging framework that reduces SMoE model parameters without retraining. We validate our approach through extensive experiments on eight zero-shot language tasks and demonstrate its effectiveness in large-scale SMoE models such as Qwen and Mixtral.
arXiv Detail & Related papers (2024-10-11T07:36:14Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings. EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
HMoE: Heterogeneous Mixture of Experts for Language Modeling [45.65121689677227]
Traditionally, Mixture of Experts (MoE) models use homogeneous experts, each with identical capacity. We propose a novel Heterogeneous Mixture of Experts (HMoE) where experts differ in size and thus possess diverse capacities. HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks.
arXiv Detail & Related papers (2024-08-20T09:35:24Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications. MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling. Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z)
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts [74.40198929049959]
Large multi-modal models (LMMs) exhibit remarkable performance across numerous tasks. generalist LMMs often suffer from performance degradation when tuned over a large collection of tasks. We propose Omni-SMoLA, an architecture that uses the Soft MoE approach to mix many multimodal low rank experts.
arXiv Detail & Related papers (2023-12-01T23:04:27Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.