UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining
- URL: http://arxiv.org/abs/2304.09151v1
- Date: Tue, 18 Apr 2023 17:45:50 GMT
- Title: UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining
- Authors: Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay,
Sharan Narang, Orhan Firat
- Abstract summary: We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages.
We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
- Score: 92.3702056505905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained multilingual large language models have typically used heuristic
temperature-based sampling to balance between different languages. However
previous work has not systematically evaluated the efficacy of different
pretraining language distributions across model scales. In this paper, we
propose a new sampling method, UniMax, that delivers more uniform coverage of
head languages while mitigating overfitting on tail languages by explicitly
capping the number of repeats over each language's corpus. We perform an
extensive series of ablations testing a range of sampling strategies on a suite
of multilingual benchmarks, while varying model scale. We find that UniMax
outperforms standard temperature-based sampling, and the benefits persist as
scale increases. As part of our contribution, we release: (i) an improved and
refreshed mC4 multilingual corpus consisting of 29 trillion characters across
107 languages, and (ii) a suite of pretrained umT5 model checkpoints trained
with UniMax sampling.
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Multilingual unsupervised sequence segmentation transfers to extremely
low-resource languages [0.0]
Unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model multilingually.
We show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language.
arXiv Detail & Related papers (2021-10-16T00:08:28Z) - Probing Multilingual Language Models for Discourse [0.0]
We find that the XLM-RoBERTa family of models consistently show the best performance.
Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations.
arXiv Detail & Related papers (2021-06-09T06:34:21Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Towards Zero-Shot Multilingual Synthetic Question and Answer Generation
for Cross-Lingual Reading Comprehension [20.570539023748424]
We propose a simple method to generate multilingual question and answer pairs on a large scale.
These synthetic samples can be used to improve the zero-shot performance of multilingual QA models on target languages.
arXiv Detail & Related papers (2020-10-22T19:59:37Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.