Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
- URL: http://arxiv.org/abs/2601.02144v1
- Date: Mon, 05 Jan 2026 14:16:11 GMT
- Title: Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
- Authors: Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang,
- Abstract summary: Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts.<n>We introduce kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases.<n>Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.
- Score: 32.65737144630759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric "router" to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the aggregate similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show kNN-MoE outperforms zero-shot baselines and rivals computationally expensive supervised fine-tuning.
Related papers
- Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs [24.791817951102487]
We show that aligning the manifold of routing weights with that of task embedding can effectively reduce the gap.<n>In experiments, we finetune routers in OLMoE, DeepSeekMoE, and Qwen3-MoE using RoMA.
arXiv Detail & Related papers (2025-11-10T18:59:53Z) - Robust Nearest Neighbour Retrieval Using Targeted Manifold Manipulation [0.0]
Nearest-neighbour retrieval is central to classification and explainable-AI pipelines.<n>We propose Targeted Manifold Manipulation-Nearest Neighbour (TMM-NN), which reconceptualises retrieval by assessing how readily each sample can be nudged into a designated region of the feature manifold.<n>TMM-NN implements this through a lightweight, query-specific trigger patch.
arXiv Detail & Related papers (2025-11-09T07:37:05Z) - Rewiring Experts on the Fly:Continuous Rerouting for Better Online Adaptation in Mixture-of-Expert models [52.502867924372275]
Mixture-of-Experts (MoE) models achieve efficient scaling through sparse expert activation, but often suffer from suboptimal routing decisions due to distribution shifts in deployment.<n>We propose textita data-free, online test-time framework that continuously adapts MoE routing decisions during text generation without external supervision or data.
arXiv Detail & Related papers (2025-10-16T16:24:36Z) - From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing [52.01745035243826]
Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts.<n> conditional routing shifts the burden on inference memory, limiting the number of experts per device.<n>We present LASER, a plug-and-play, inference-time routing algorithm that balances load while preserving accuracy.
arXiv Detail & Related papers (2025-09-29T16:29:17Z) - RouteMark: A Fingerprint for Intellectual Property Attribution in Routing-based Model Merging [69.2230254959204]
We propose RouteMark, a framework for IP protection in merged MoE models.<n>Our key insight is that task-specific experts exhibit stable and distinctive routing behaviors under probing inputs.<n>For attribution and tampering detection, we introduce a similarity-based matching algorithm.
arXiv Detail & Related papers (2025-08-03T14:51:58Z) - MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts [29.11217299899888]
MoQAE is a mixed-precision quantization method via mixture of quantization-aware experts.<n>We show that MoQAE outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-09T08:16:24Z) - Tight Clusters Make Specialized Experts [1.7597562616011944]
Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost.<n>We present a novel router that learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts.<n>Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement.
arXiv Detail & Related papers (2025-02-21T09:10:54Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Tractable Bounding of Counterfactual Queries by Knowledge Compilation [51.47174989680976]
We discuss the problem of bounding partially identifiable queries, such as counterfactuals, in Pearlian structural causal models.
A recently proposed iterated EM scheme yields an inner approximation of those bounds by sampling the initialisation parameters.
We show how a single symbolic knowledge compilation allows us to obtain the circuit structure with symbolic parameters to be replaced by their actual values.
arXiv Detail & Related papers (2023-10-05T07:10:40Z) - Sparse Backpropagation for MoE Training [118.31785160874024]
We introduce SparseMixer, a scalable gradient estimator that bridges the gap between backpropagation and sparse expert routing.
Grounded in a numerical ODE framework, SparseMixer harnesses the mid-point method, a second-order ODE solver, to deliver precise gradient approximations.
Applying SparseMixer to Switch Transformer on both pre-training and machine translation tasks, SparseMixer showcases considerable performance gain.
arXiv Detail & Related papers (2023-10-01T22:43:57Z) - Soft Merging of Experts with Adaptive Routing [38.962451264172856]
We introduce Soft Merging of Experts with Adaptive Routing (SMEAR)
SMEAR avoids discrete routing by using a single "merged" expert constructed via a weighted average of all of the experts' parameters.
We empirically validate that models using SMEAR outperform models that route based on metadata or learn sparse routing through gradient estimation.
arXiv Detail & Related papers (2023-06-06T15:04:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.