UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining
- URL: http://arxiv.org/abs/2304.09151v1
- Date: Tue, 18 Apr 2023 17:45:50 GMT
- Title: UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining
- Authors: Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay,
Sharan Narang, Orhan Firat
- Abstract summary: We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages.
We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
- Score: 92.3702056505905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained multilingual large language models have typically used heuristic
temperature-based sampling to balance between different languages. However
previous work has not systematically evaluated the efficacy of different
pretraining language distributions across model scales. In this paper, we
propose a new sampling method, UniMax, that delivers more uniform coverage of
head languages while mitigating overfitting on tail languages by explicitly
capping the number of repeats over each language's corpus. We perform an
extensive series of ablations testing a range of sampling strategies on a suite
of multilingual benchmarks, while varying model scale. We find that UniMax
outperforms standard temperature-based sampling, and the benefits persist as
scale increases. As part of our contribution, we release: (i) an improved and
refreshed mC4 multilingual corpus consisting of 29 trillion characters across
107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained
with UniMax sampling.
Related papers
- Scaling Laws for Multilingual Language Models [41.6318470003173]
A primary challenge in studying multilingual scaling is the difficulty of analyzing individual language performance due to cross-lingual transfer.
We introduce and validate a hypothesis that the test cross-entropy loss for each language family is determined solely by its own sampling ratio.
We derive a power-law relationship that links performance with dataset size, model size and sampling ratios.
arXiv Detail & Related papers (2024-10-15T20:29:38Z) - EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models [50.459861376459656]
EMMA-500 is a large-scale multilingual language model continue-trained on texts across 546 languages.
Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity.
arXiv Detail & Related papers (2024-09-26T14:40:45Z) - Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques.
Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Probing Multilingual Language Models for Discourse [0.0]
We find that the XLM-RoBERTa family of models consistently show the best performance.
Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations.
arXiv Detail & Related papers (2021-06-09T06:34:21Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Towards Zero-Shot Multilingual Synthetic Question and Answer Generation
for Cross-Lingual Reading Comprehension [20.570539023748424]
We propose a simple method to generate multilingual question and answer pairs on a large scale.
These synthetic samples can be used to improve the zero-shot performance of multilingual QA models on target languages.
arXiv Detail & Related papers (2020-10-22T19:59:37Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.