Few-shot Learning with Multilingual Language Models
- URL: http://arxiv.org/abs/2112.10668v1
- Date: Mon, 20 Dec 2021 16:52:35 GMT
- Title: Few-shot Learning with Multilingual Language Models
- Authors: Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui
Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du,
Ramakanth Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav Chaudhary, Brian
O'Horo, Jeff Wang, Luke Zettlemoyer, Zornitsa Kozareva, Mona Diab, Veselin
Stoyanov, Xian Li
- Abstract summary: We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
- Score: 66.49496434282564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale autoregressive language models such as GPT-3 are few-shot
learners that can perform a wide range of language tasks without fine-tuning.
While these models are known to be able to jointly represent many different
languages, their training data is dominated by English, potentially limiting
their cross-lingual generalization. In this work, we train multilingual
autoregressive language models on a balanced corpus covering a diverse set of
languages, and study their few- and zero-shot learning capabilities in a wide
range of tasks. Our largest model with 7.5 billion parameters sets new state of
the art in few-shot learning in more than 20 representative languages,
outperforming GPT-3 of comparable size in multilingual commonsense reasoning
(with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in
4-shot settings) and natural language inference (+5.4% in each of 0-shot and
4-shot settings). On the FLORES-101 machine translation benchmark, our model
outperforms GPT-3 on 171 out of 182 translation directions with 32 training
examples, while surpassing the official supervised baseline in 45 directions.
We present a detailed analysis of where the model succeeds and fails, showing
in particular that it enables cross-lingual in-context learning on some tasks,
while there is still room for improvement on surface form robustness and
adaptation to tasks that do not have a natural cloze form. Finally, we evaluate
our models in social value tasks such as hate speech detection in five
languages and find it has limitations similar to comparable sized GPT-3 models.
Related papers
- BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Sabi\'a: Portuguese Large Language Models [14.801853435122908]
We show that monolingual pretraining on the target language significantly improves models already extensively trained on diverse corpora.
Few-shot evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models outperform English-centric and multilingual counterparts by a significant margin.
arXiv Detail & Related papers (2023-04-16T20:11:19Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Language Models are Few-shot Multilingual Learners [66.11011385895195]
We evaluate the multilingual skills of the GPT and T5 models in conducting multi-class classification on non-English languages.
We show that, given a few English examples as context, pre-trained language models can predict not only English test samples but also non-English ones.
arXiv Detail & Related papers (2021-09-16T03:08:22Z) - Finetuned Language Models Are Zero-Shot Learners [67.70352207685558]
We show that instruction tuning boosts zero-shot performance on unseen tasks.
We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates.
We evaluate this instruction-tuned model, which we call FLAN, on unseen task types.
arXiv Detail & Related papers (2021-09-03T17:55:52Z) - On the Multilingual Capabilities of Very Large-Scale English Language
Models [0.0]
Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning.
In this work, we investigate the multilingual skills of GPT-3, focusing on one language that barely appears in the pre-training corpus, Catalan.
We find that the model shows an outstanding performance, particularly in generative tasks, with predictable limitations mostly in language understanding tasks but still with remarkable results given the zero-shot scenario.
arXiv Detail & Related papers (2021-08-30T16:18:50Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - Probing Multilingual Language Models for Discourse [0.0]
We find that the XLM-RoBERTa family of models consistently show the best performance.
Our results also indicate that model distillation may hurt the ability of cross-lingual transfer of sentence representations.
arXiv Detail & Related papers (2021-06-09T06:34:21Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.