Related papers: The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining

URL: http://arxiv.org/abs/2601.00364v1
Date: Thu, 01 Jan 2026 14:52:06 GMT
Title: The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
Authors: Jiandong Shao, Raphael Tang, Crystina Zhang, Karin Sevegnani, Pontus Stenetorp, Jianfei Yang, Yao Lu,
Abstract summary: We compare the standard web corpus with a monolingual-only version that removes all multilingual documents.<n>We categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on semantic relevance.<n>Our experiments reveal that parallel data almost fully restores translation performance, whereas code-switching contributes minimally.
Score: 29.376308590290297
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in different languages. We then conduct granular ablations by reintroducing parallel or code-switching data into the monolingual-only corpus. Our experiments reveal that parallel data almost fully restores translation performance (91% of the unfiltered baseline), whereas code-switching contributes minimally. Other cross-lingual tasks remain largely unaffected by either type. These findings reveal that translation critically depends on systematic token-level alignments from parallel data, whereas cross-lingual understanding and reasoning appear to be achievable even without bilingual data.

Related papers

Revisiting Multilingual Data Mixtures in Language Model Pretraining [20.282622416939997]
We study the impact of different multilingual data mixtures in pretraining large language models.<n>We find that combining English and multilingual data does not necessarily degrade the in-language performance of either group.<n>We do not observe a significant "curse of multilinguality" as the number of training languages increases.
arXiv Detail & Related papers (2025-10-29T20:46:03Z)
Assessing the Role of Data Quality in Training Bilingual Language Models [17.603371705571107]
We show that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings.<n>We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data.
arXiv Detail & Related papers (2025-06-15T21:08:51Z)
Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval [5.446052898856584]
This paper proposes a novel hybrid batch training strategy to improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size.
arXiv Detail & Related papers (2024-08-20T04:30:26Z)
Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment [50.80949663719335]
multilingual sentence encoders (MSEs) are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space.<n>MSEs are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing.<n>We train the cross-lingual adapters with two different types of data to resolve the conflicting requirements of different cross-lingual tasks.
arXiv Detail & Related papers (2024-07-20T13:56:39Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Improving In-context Learning of Multilingual Generative Language Models with Cross-lingual Alignment [42.624862172666624]
We propose a simple yet effective cross-lingual alignment framework exploiting pairs of translation sentences. It aligns the internal sentence representations across different languages via multilingual contrastive learning. Experimental results show that even with less than 0.1 textperthousand of pre-training tokens, our alignment framework significantly boosts the cross-lingual abilities of generative language models.
arXiv Detail & Related papers (2023-11-14T11:24:08Z)
Bridging Cross-Lingual Gaps During Leveraging the Multilingual Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages. Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z)
Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models. We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks. We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z)
VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z)
What makes multilingual BERT multilingual? [60.9051207862378]
In this work, we provide an in-depth experimental study to supplement the existing literature of cross-lingual ability. We compare the cross-lingual ability of non-contextualized and contextualized representation model with the same data. We found that datasize and context window size are crucial factors to the transferability.
arXiv Detail & Related papers (2020-10-20T05:41:56Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.