Related papers: Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering

Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering

URL: http://arxiv.org/abs/2601.14050v1
Date: Tue, 20 Jan 2026 15:04:25 GMT
Title: Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering
Authors: Yuxin Chen, Zhengzhou Cai, Xiangtian Ji, Weixiang Zhao, An Zhang, Xiang Wang, Tat-Seng Chua,
Abstract summary: We propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time.<n>Our code is available at http://conctsai.com/multilingualism-in-Mixture-of-Experts-LLMs.
Score: 61.0787902713059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture-of-Experts (MoE) architectures have shown strong multilingual capabilities, yet the internal mechanisms underlying performance gains and cross-language differences remain insufficiently understood. In this work, we conduct a systematic analysis of MoE models, examining routing behavior and expert specialization across languages and network depth. Our analysis reveals that multilingual processing in MoE models is highly structured: routing aligns with linguistic families, expert utilization follows a clear layerwise pattern, and high-resource languages rely on shared experts while low-resource languages depend more on language-exclusive experts despite weaker performance. Layerwise interventions further show that early and late MoE layers support language-specific processing, whereas middle layers serve as language-agnostic capacity hubs. Building on these insights, we propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time, leading to consistent multilingual performance improvements, particularly for linguistically related language pairs. Our code is available at https://github.com/conctsai/Multilingualism-in-Mixture-of-Experts-LLMs.

Related papers

Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting [69.6938830307759]
Lamer-SSL is a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy.<n> Experiments on automatic speech recognition (ASR) and language identification (LID) demonstrate that Lamer-SSL extends self-supervised models to new languages effectively.
arXiv Detail & Related papers (2026-02-13T09:22:22Z)
Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders [51.380449540006985]
Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear.<n>Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language?<n>We analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs.
arXiv Detail & Related papers (2025-11-13T22:51:06Z)
Multilingual Routing in Mixture-of-Experts [45.90403983668531]
We analyze expert routing patterns using parallel multilingual datasets.<n>We find that MoE models route tokens in language-specific ways in the early and late decoder layers.<n>We introduce a method that steers the router by promoting middle-layer task experts frequently activated in English.
arXiv Detail & Related papers (2025-10-06T11:09:20Z)
Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts [98.73585104789217]
We propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer.<n>Our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting.
arXiv Detail & Related papers (2025-05-28T16:54:53Z)
LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy [33.85811169010525]
Large language models (LLMs) exhibit suboptimal performance on low-resource languages.<n>Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models.<n>We propose aname, a framework that integrates representations from all encoder layers.
arXiv Detail & Related papers (2025-02-17T03:45:03Z)
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.<n>LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.<n>We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z)
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts [29.091853631327304]
Adapting medical Large Language Models to local languages can reduce barriers to accessing healthcare services.<n>We first construct a high-quality medical dataset and conduct analysis to ensure its quality.<n>We propose a novel MoE routing method that employs language-specific experts and cross-lingual routing.
arXiv Detail & Related papers (2024-10-14T15:31:54Z)
Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model [12.030995417911296]
This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.
arXiv Detail & Related papers (2024-09-03T16:53:38Z)
Understanding the role of FFNs in driving multilingual behaviour in LLMs [0.0]
In this paper, we conduct an in-depth analysis of the multilingual capabilities of a family of Large Language Models. We introduce novel metrics to probe the model's multilingual behaviour at different layers and shed light on the impact of architectural choices on multilingual processing.
arXiv Detail & Related papers (2024-04-22T03:47:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.