Multilingual Routing in Mixture-of-Experts
- URL: http://arxiv.org/abs/2510.04694v1
- Date: Mon, 06 Oct 2025 11:09:20 GMT
- Title: Multilingual Routing in Mixture-of-Experts
- Authors: Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng,
- Abstract summary: We analyze expert routing patterns using parallel multilingual datasets.<n>We find that MoE models route tokens in language-specific ways in the early and late decoder layers.<n>We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English.
- Score: 45.90403983668531
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Mixture-of-Experts (MoE) architectures have become the key to scaling modern LLMs, yet little is understood about how their sparse routing dynamics respond to multilingual data. In this work, we analyze expert routing patterns using parallel multilingual datasets and present highly interpretable layer-wise phenomena. We find that MoE models route tokens in language-specific ways in the early and late decoder layers but exhibit significant cross-lingual routing alignment in middle layers, mirroring parameter-sharing trends observed in dense LLMs. In particular, we reveal a clear, strong correlation between a model's performance in a given language and how similarly its tokens are routed to English in these layers. Extending beyond correlation, we explore inference-time interventions that induce higher cross-lingual routing alignment. We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English, and it successfully increases multilingual performance. These 1-2% gains are remarkably consistent across two evaluation tasks, three models, and 15+ languages, especially given that these simple interventions override routers of extensively trained, state-of-the-art LLMs. In comparison, interventions outside of the middle layers or targeting multilingual-specialized experts only yield performance degradation. Altogether, we present numerous findings that explain how MoEs process non-English text and demonstrate that generalization is limited by the model's ability to leverage language-universal experts in all languages.
Related papers
- Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering [61.0787902713059]
We propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time.<n>Our code is available at http://conctsai.com/multilingualism-in-Mixture-of-Experts-LLMs.
arXiv Detail & Related papers (2026-01-20T15:04:25Z) - Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders [51.380449540006985]
Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear.<n>Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language?<n>We analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs.
arXiv Detail & Related papers (2025-11-13T22:51:06Z) - Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts [98.73585104789217]
We propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer.<n>Our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting.
arXiv Detail & Related papers (2025-05-28T16:54:53Z) - Enhancing Non-English Capabilities of English-Centric Large Language Models through Deep Supervision Fine-Tuning [42.166438218926274]
We introduce a deep supervision fine-tuning method (DFT) that incorporates additional supervision in the internal layers of the model to guide its workflow.<n>Our method guides the model to not only consider the final generated result when processing non-English inputs but also ensure the accuracy of internal representations.
arXiv Detail & Related papers (2025-03-03T07:59:32Z) - LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy [33.85811169010525]
Large language models (LLMs) exhibit suboptimal performance on low-resource languages.<n>Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models.<n>We propose aname, a framework that integrates representations from all encoder layers.
arXiv Detail & Related papers (2025-02-17T03:45:03Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Understanding the role of FFNs in driving multilingual behaviour in LLMs [0.0]
In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of Large Language Models.
We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on the impact of architectural choices on multilingual processing.
arXiv Detail & Related papers (2024-04-22T03:47:00Z) - Adapting Large Language Models for Document-Level Machine Translation [46.370862171452444]
Large language models (LLMs) have significantly advanced various natural language processing (NLP) tasks.
Recent research indicates that moderately-sized LLMs often outperform larger ones after task-specific fine-tuning.
This study focuses on adapting LLMs for document-level machine translation (DocMT) for specific language pairs.
arXiv Detail & Related papers (2024-01-12T09:29:13Z) - Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual
Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved.
However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.