No Language Left Behind: Scaling Human-Centered Machine Translation
- URL: http://arxiv.org/abs/2207.04672v1
- Date: Mon, 11 Jul 2022 07:33:36 GMT
- Title: No Language Left Behind: Scaling Human-Centered Machine Translation
- Authors: NLLB team, Marta R. Costa-juss\`a, James Cross, Onur \c{C}elebi, Maha
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam,
Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al
Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip
Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe,
Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale,
Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzm\'an,
Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem,
Holger Schwenk, Jeff Wang (NLLB Team)
- Abstract summary: We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
- Score: 69.28110770760506
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Driven by the goal of eradicating language barriers on a global scale,
machine translation has solidified itself as a key focus of artificial
intelligence research today. However, such efforts have coalesced around a
small subset of languages, leaving behind the vast majority of mostly
low-resource languages. What does it take to break the 200 language barrier
while ensuring safe, high quality results, all while keeping ethical
considerations in mind? In No Language Left Behind, we took on this challenge
by first contextualizing the need for low-resource language translation support
through exploratory interviews with native speakers. Then, we created datasets
and models aimed at narrowing the performance gap between low and high-resource
languages. More specifically, we developed a conditional compute model based on
Sparsely Gated Mixture of Experts that is trained on data obtained with novel
and effective data mining techniques tailored for low-resource languages. We
propose multiple architectural and training improvements to counteract
overfitting while training on thousands of tasks. Critically, we evaluated the
performance of over 40,000 different translation directions using a
human-translated benchmark, Flores-200, and combined human evaluation with a
novel toxicity benchmark covering all languages in Flores-200 to assess
translation safety. Our model achieves an improvement of 44% BLEU relative to
the previous state-of-the-art, laying important groundwork towards realizing a
universal translation system. Finally, we open source all contributions
described in this work, accessible at
https://github.com/facebookresearch/fairseq/tree/nllb.
Related papers
- Low-Resource Machine Translation through the Lens of Personalized Federated Learning [26.436144338377755]
We present a new approach that can be applied to Natural Language Tasks with heterogeneous data.
We evaluate it on the Low-Resource Machine Translation task, using the dataset from the Large-Scale Multilingual Machine Translation Shared Task.
In addition to its effectiveness, MeritFed is also highly interpretable, as it can be applied to track the impact of each language used for training.
arXiv Detail & Related papers (2024-06-18T12:50:00Z) - Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study [1.6819960041696331]
In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian.
Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance.
Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement.
arXiv Detail & Related papers (2024-04-12T06:16:26Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Improving Cross-lingual Information Retrieval on Low-Resource Languages
via Optimal Transport Distillation [21.057178077747754]
In this work, we propose OPTICAL: Optimal Transport distillation for low-resource Cross-lingual information retrieval.
By separating the cross-lingual knowledge from knowledge of query document matching, OPTICAL only needs bitext data for distillation training.
Experimental results show that, with minimal training data, OPTICAL significantly outperforms strong baselines on low-resource languages.
arXiv Detail & Related papers (2023-01-29T22:30:36Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - How Low is Too Low? A Computational Perspective on Extremely
Low-Resource Languages [1.7625363344837164]
We introduce the first cross-lingual information extraction pipeline for Sumerian.
We also curate InterpretLR, an interpretability toolkit for low-resource NLP.
Most components of our pipeline can be generalised to any other language to obtain an interpretable execution.
arXiv Detail & Related papers (2021-05-30T12:09:59Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.