Related papers: Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models

URL: http://arxiv.org/abs/2508.01261v1
Date: Sat, 02 Aug 2025 08:33:30 GMT
Title: Unifying Mixture of Experts and Multi-Head Latent Attention for Efficient Language Models
Authors: Sushant Mehta, Raj Dandekar, Rajat Dandekar, Sreedath Panat,
Abstract summary: MoE-MLA-RoPE is a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling.<n>Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations.
Score: 1.7272658301768147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present MoE-MLA-RoPE, a novel architecture combination that combines Mixture of Experts (MoE) with Multi-head Latent Attention (MLA) and Rotary Position Embeddings (RoPE) for efficient language modeling. Our approach addresses the fundamental trade-off between model capacity and computational efficiency through three key innovations: (1) fine-grained expert routing with 64 micro-experts and top-$k$ selection, enabling flexible specialization through 3.6 * 10^7 possible expert combinations; (2) shared expert isolation that dedicates 2 always active experts for common patterns while routing to 6 of 62 specialized experts; and (3) gradient-conflict-free load balancing that maintains expert utilization without interfering with primary loss optimization. Extensive experiments on models ranging from 17M to 202M parameters demonstrate that MoE-MLA-RoPE with compression ratio r=d/2 achieves 68% KV cache memory reduction and 3.2x inference speedup while maintaining competitive perplexity (0.8% degradation). Compared to the parameters with 53.9M parameters, MoE-MLA-RoPE improves the validation loss by 6.9% over the vanilla transformers while using 42% fewer active parameters per forward pass. FLOP-matched experiments reveal even larger gains: 11.1% improvement with 3.2x inference acceleration. Automated evaluation using GPT-4 as a judge confirms quality improvements in generation, with higher scores on coherence (8.1/10), creativity (7.9/10) and grammatical correctness (8.2/10). Our results establish that architectural novelty, not parameter scaling, defines the efficiency frontier for resource-constrained language model deployment.

Related papers

Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts [0.0]
Latent Prototype Routing (LPR) is a novel routing framework that promotes balanced expert utilization without compromising downstream performance.<n>LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to 0.70, achieving near-perfect load balancing.
arXiv Detail & Related papers (2025-06-26T14:41:18Z)
LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing [17.171872354057694]
We propose the LoRA-Mixer, a modular and lightweight MoE framework that integrates LoRA experts.<n>Our core innovation lies in replacing the projection matrices of the attention module's input/output linear layers with task-specific LoRA experts.<n>LoRA-Mixer achieves significant improvements on datasets such as GSM8K, HumanEval, and MedQA.
arXiv Detail & Related papers (2025-06-17T14:58:54Z)
EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
DynMoLE: Boosting Mixture of LoRA Experts Fine-Tuning with a Hybrid Routing Mechanism [5.988126768890861]
DynMoLE is a hybrid routing strategy that dynamically adjusts expert selection based on the Tsallis entropy of the router's probability distribution.<n>Our experiments on commonsense reasoning benchmarks demonstrate that DynMoLE achieves substantial performance improvements.
arXiv Detail & Related papers (2025-04-01T11:14:19Z)
Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer [5.585222292493927]
We propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement selective routing on input data and experts.<n>Experiments demonstrate that UoE model surpass Full Attention, state-of-art MoEs and efficient transformers.
arXiv Detail & Related papers (2025-03-04T11:01:25Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts [95.26323548734692]
MoMa is a modality-aware mixture-of-experts architecture for pre-training mixed-modal, early-fusion language models. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings.
arXiv Detail & Related papers (2024-07-31T17:46:51Z)
Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [16.062265609569003]
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs)<n>We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation; (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens; and (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis.
arXiv Detail & Related papers (2024-05-24T02:50:44Z)
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems [63.713297451300086]
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B. Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
arXiv Detail & Related papers (2022-06-15T20:44:23Z)
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)
Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.