Related papers: Mixture of Lookup Key-Value Experts

Mixture of Lookup Key-Value Experts

URL: http://arxiv.org/abs/2512.09723v1
Date: Wed, 10 Dec 2025 15:05:55 GMT
Title: Mixture of Lookup Key-Value Experts
Authors: Zongcheng Wang,
Abstract summary: We present the textbfMixture textbfof textbfLookup textbfKey-textbfValue Experts (textbfMoLKV) model.<n>MoLKV achieves significantly lower validation loss in small-scale evaluations.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.

Related papers

Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging [17.490596264046435]
Sub-MoE is a novel MoE compression framework via Subspace Expert Merging.<n>Our key insight is to perform joint Singular Value Decomposition (SVD) ond expert weights.<n>Our Sub-MoE significantly outperforms existing expert pruning and merging methods.
arXiv Detail & Related papers (2025-06-29T14:43:50Z)
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations [48.890534958441016]
This study investigates domain specialization and expert redundancy in large-scale MoE models.<n>We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts.<n>Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full model with only half the experts.
arXiv Detail & Related papers (2025-04-09T11:34:06Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Mixture of Tunable Experts - Behavior Modification of DeepSeek-R1 at Inference Time [1.1655046053160683]
We present a method that extends the Mixture-of-Experts architecture of Large Language Models (LLMs)<n>MoTE enables meaningful and focused behavior changes in LLMs on-the-fly during inference time.
arXiv Detail & Related papers (2025-02-16T12:24:39Z)
AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models [14.646419975663367]
We introduce AdaMoE to realize token-adaptive routing for MoE. AdaMoE does not force each token to occupy a fixed number of null experts. It can reduce average expert load (FLOPs) while achieving superior performance.
arXiv Detail & Related papers (2024-06-19T05:47:10Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.