Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
- URL: http://arxiv.org/abs/2504.04152v1
- Date: Sat, 05 Apr 2025 12:10:55 GMT
- Title: Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
- Authors: Zihao Li, Shaoxiong Ji, Hengyu Luo, Jörg Tiedemann,
- Abstract summary: Large Language Models (LLMs) exhibit significant disparities in performance across languages.<n>Continual Pretraining (CPT) has emerged as a promising approach to address this imbalance.<n>This study systematically evaluates 36 CPT configurations involving three multilingual base models.
- Score: 12.54580975652981
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual Pretraining (CPT) has emerged as a promising approach to address this imbalance, although the relative effectiveness of monolingual, bilingual, and code-augmented data strategies remains unclear. This study systematically evaluates 36 CPT configurations involving three multilingual base models, across 30+ languages categorized as altruistic, selfish, and stagnant, spanning various resource levels. Our findings reveal three major insights: (1) Bilingual CPT improves multilingual classification but often causes language mixing issues during generation. (2) Including programming code data during CPT consistently enhances multilingual classification accuracy, particularly benefiting low-resource languages, but introduces a trade-off by slightly degrading generation quality. (3) Contrary to prior work, we observe substantial deviations from language classifications according to their impact on cross-lingual transfer: Languages classified as altruistic often negatively affect related languages, selfish languages show conditional and configuration-dependent behavior, and stagnant languages demonstrate surprising adaptability under certain CPT conditions. These nuanced interactions emphasize the complexity of multilingual representation learning, underscoring the importance of systematic studies on generalizable language classification to inform future multilingual CPT strategies.
Related papers
- High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs.<n>We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions.<n>We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z) - Assessing Large Language Models in Agentic Multilingual National Bias [31.67058518564021]
Cross-language disparities in reasoning-based recommendations remain largely unexplored.<n>This study is the first to address this gap.<n>We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages.
arXiv Detail & Related papers (2025-02-25T08:07:42Z) - How does a Multilingual LM Handle Multiple Languages? [0.0]
This study critically examines capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer.<n>It assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity.<n>It examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures.
arXiv Detail & Related papers (2025-02-06T18:08:14Z) - Guardians of Discourse: Evaluating LLMs on Multilingual Offensive Language Detection [10.129235204880443]
We evaluate the impact of different prompt languages and augmented translation data for the task in non-English contexts.
We discuss the impact of the inherent bias in LLMs and the datasets in the mispredictions related to sensitive topics.
arXiv Detail & Related papers (2024-10-21T04:08:16Z) - Exploring the Maze of Multilingual Modeling [2.0849578298972835]
We present a comprehensive evaluation of three popular multilingual language models: mBERT, XLM-R, and GPT-3.
Our findings reveal that while the amount of language-specific pretraining data plays a crucial role in model performance, we also identify other factors such as general resource availability, language family, and script type, as important features.
arXiv Detail & Related papers (2023-10-09T04:48:14Z) - Cross-lingual Lifelong Learning [53.06904052325966]
We present a principled Cross-lingual Continual Learning (CCL) evaluation paradigm.
We provide insights into what makes multilingual sequential learning particularly challenging.
The implications of this analysis include a recipe for how to measure and balance different cross-lingual continual learning desiderata.
arXiv Detail & Related papers (2022-05-23T09:25:43Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Learning Multilingual Representation for Natural Language Understanding
with Enhanced Cross-Lingual Supervision [42.724921817550516]
We propose a network named decomposed attention (DA) as a replacement of MA.
The DA consists of an intra-lingual attention (IA) and a cross-lingual attention (CA), which model intralingual and cross-lingual supervisions respectively.
Experiments on various cross-lingual natural language understanding tasks show that the proposed architecture and learning strategy significantly improve the model's cross-lingual transferability.
arXiv Detail & Related papers (2021-06-09T16:12:13Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - On Negative Interference in Multilingual Models: Findings and A
Meta-Learning Treatment [59.995385574274785]
We show that, contrary to previous belief, negative interference also impacts low-resource languages.
We present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference.
arXiv Detail & Related papers (2020-10-06T20:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.