Related papers: Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts

Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts

URL: http://arxiv.org/abs/2505.22582v1
Date: Wed, 28 May 2025 16:54:53 GMT
Title: Less, but Better: Efficient Multilingual Expansion for LLMs via Layer-wise Mixture-of-Experts
Authors: Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou,
Abstract summary: We propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer.<n>Our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting.
Score: 98.73585104789217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continually expanding new languages for existing large language models (LLMs) is a promising yet challenging approach to building powerful multilingual LLMs. The biggest challenge is to make the model continuously learn new languages while preserving the proficient ability of old languages. To achieve this, recent work utilizes the Mixture-of-Experts (MoE) architecture to expand new languages by adding new experts and avoid catastrophic forgetting of old languages by routing corresponding tokens to the original model backbone (old experts). Although intuitive, this kind of method is parameter-costly when expanding new languages and still inevitably impacts the performance of old languages. To address these limitations, we analyze the language characteristics of different layers in LLMs and propose a layer-wise expert allocation algorithm (LayerMoE) to determine the appropriate number of new experts for each layer. Specifically, we find different layers in LLMs exhibit different representation similarities between languages and then utilize the similarity as the indicator to allocate experts for each layer, i.e., the higher similarity, the fewer experts. Additionally, to further mitigate the forgetting of old languages, we add a classifier in front of the router network on the layers with higher similarity to guide the routing of old language tokens. Experimental results show that our method outperforms the previous state-of-the-art baseline with 60% fewer experts in the single-expansion setting and with 33.3% fewer experts in the lifelong-expansion setting, demonstrating the effectiveness of our method.

Related papers

Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model [38.0723521889505]
Experimental results on 18 to 128 languages show that our method reduces the negative transfer between languages and significantly boosts multilingual performance with fewer parameters.<n>Such language group specialization on experts benefits the new language adaptation and reduces the inference on the previous multilingual knowledge learned.
arXiv Detail & Related papers (2025-06-14T07:56:18Z)
LayAlign: Enhancing Multilingual Reasoning in Large Language Models via Layer-Wise Adaptive Fusion and Alignment Strategy [33.85811169010525]
Large language models (LLMs) exhibit suboptimal performance on low-resource languages.<n>Recent approaches have leveraged multilingual encoders alongside LLMs by introducing trainable parameters connecting the two models.<n>We propose aname, a framework that integrates representations from all encoder layers.
arXiv Detail & Related papers (2025-02-17T03:45:03Z)
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.<n>LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.<n>We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z)
Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts [29.091853631327304]
Adapting medical Large Language Models to local languages can reduce barriers to accessing healthcare services.<n>We first construct a high-quality medical dataset and conduct analysis to ensure its quality.<n>We propose a novel MoE routing method that employs language-specific experts and cross-lingual routing.
arXiv Detail & Related papers (2024-10-14T15:31:54Z)
MoE-LPR: Multilingual Extension of Large Language Models through Mixture-of-Experts with Language Priors Routing [78.62611800987817]
Large Language Models (LLMs) are often English-centric due to the disproportionate distribution of languages in their pre-training data. We propose a method called MoE-LPR (Mixture-of-Experts with Language Priors) to enhance the multilingual capability.
arXiv Detail & Related papers (2024-08-21T07:43:49Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
Unraveling Babel: Exploring Multilingual Activation Patterns of LLMs and Their Applications [24.18102112644796]
We study the internal neuron activation patterns of large language models (LLMs) when processing different languages. We leverage the discovered differences in expert activation frequencies to guide sparse activation and pruning. Our findings offer new perspectives for applications such as sparse activation and model pruning.
arXiv Detail & Related papers (2024-02-26T07:44:56Z)
How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z)
Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert. We show that our approach significantly improves the translation performance between the new and the original languages. We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.