Sheffield's Submission to the AmericasNLP Shared Task on Machine
Translation into Indigenous Languages
- URL: http://arxiv.org/abs/2306.09830v1
- Date: Fri, 16 Jun 2023 13:15:26 GMT
- Title: Sheffield's Submission to the AmericasNLP Shared Task on Machine
Translation into Indigenous Languages
- Authors: Edward Gow-Smith, Danae S\'anchez Villegas
- Abstract summary: We describe the University of Sheffield's submission to the AmericasNLP 2023 Shared Task on Machine Translation into Indigenous languages.
Our approach consists of extending, training, and ensembling different variations of NLLB-200.
On the dev set, our best submission outperforms the baseline by 11% average chrF across all languages, with substantial improvements particularly for Aymara, Guarani and Quechua.
- Score: 4.251500966181852
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper we describe the University of Sheffield's submission to the
AmericasNLP 2023 Shared Task on Machine Translation into Indigenous Languages
which comprises the translation from Spanish to eleven indigenous languages.
Our approach consists of extending, training, and ensembling different
variations of NLLB-200. We use data provided by the organizers and data from
various other sources such as constitutions, handbooks, news articles, and
backtranslations generated from monolingual data. On the dev set, our best
submission outperforms the baseline by 11% average chrF across all languages,
with substantial improvements particularly for Aymara, Guarani and Quechua. On
the test set, we achieve the highest average chrF of all the submissions, we
rank first in four of the eleven languages, and at least one of our submissions
ranks in the top 3 for all languages.
Related papers
- Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Enhancing Translation for Indigenous Languages: Experiments with
Multilingual Models [57.10972566048735]
We present the system descriptions for three methods.
We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model.
We experimented with 11 languages from America and report the setups we used as well as the results we achieved.
arXiv Detail & Related papers (2023-05-27T08:10:40Z) - PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for
Languages in India [33.31556860332746]
PMIndiaSum is a multilingual and massively parallel summarization corpus focused on languages in India.
Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs.
arXiv Detail & Related papers (2023-05-15T17:41:15Z) - SheffieldVeraAI at SemEval-2023 Task 3: Mono and multilingual approaches
for news genre, topic and persuasion technique classification [3.503844033591702]
This paper describes our approach for SemEval-2023 Task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup.
arXiv Detail & Related papers (2023-03-16T15:54:23Z) - Enhancing Model Performance in Multilingual Information Retrieval with
Comprehensive Data Engineering Techniques [10.57012904999091]
We fine-tune pre-trained multilingual transformer-based models with MIRACL dataset.
Our model improvement is mainly achieved through diverse data engineering techniques.
We secure 2nd place in the Surprise-Languages track with a score of 0.835 and 3rd place in the Known-Languages track with an average nDCG@10 score of 0.716 across the 16 known languages on the final leaderboard.
arXiv Detail & Related papers (2023-02-14T12:37:32Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Facebook AI WMT21 News Translation Task Submission [23.69817809546458]
We describe Facebook's multilingual model submission to the WMT2021 shared task on news translation.
We participate in 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese.
We utilize data from all available sources to create high quality bilingual and multilingual baselines.
arXiv Detail & Related papers (2021-08-06T18:26:38Z) - Facebook AI's WMT20 News Translation Task Submission [69.92594751788403]
This paper describes Facebook AI's submission to WMT20 shared news translation task.
We focus on the low resource setting and participate in two language pairs, Tamil -> English and Inuktitut -> English.
We approach the low resource problem using two main strategies, leveraging all available data and adapting the system to the target news domain.
arXiv Detail & Related papers (2020-11-16T21:49:00Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.