Low Resource Neural Machine Translation: A Benchmark for Five African
Languages
- URL: http://arxiv.org/abs/2003.14402v1
- Date: Tue, 31 Mar 2020 17:50:07 GMT
- Title: Low Resource Neural Machine Translation: A Benchmark for Five African
Languages
- Authors: Surafel M. Lakew, Matteo Negri, Marco Turchi
- Abstract summary: We benchmark NMT between English and five African LRL pairs (Swahili, Amharic, Tigrigna, Oromo, Somali)
We compare a baseline single language pair NMT model against semi-supervised learning, transfer learning, and multilingual modeling.
In terms of averaged BLEU score, the multilingual approach shows the largest gains, up to +5 points, in six out of ten translation directions.
- Score: 14.97774471012222
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advents in Neural Machine Translation (NMT) have shown improvements in
low-resource language (LRL) translation tasks. In this work, we benchmark NMT
between English and five African LRL pairs (Swahili, Amharic, Tigrigna, Oromo,
Somali [SATOS]). We collected the available resources on the SATOS languages to
evaluate the current state of NMT for LRLs. Our evaluation, comparing a
baseline single language pair NMT model against semi-supervised learning,
transfer learning, and multilingual modeling, shows significant performance
improvements both in the En-LRL and LRL-En directions. In terms of averaged
BLEU score, the multilingual approach shows the largest gains, up to +5 points,
in six out of ten translation directions. To demonstrate the generalization
capability of each model, we also report results on multi-domain test sets. We
release the standardized experimental data and the test sets for future works
addressing the challenges of NMT in under-resourced settings, in particular for
the SATOS languages.
Related papers
- NusaMT-7B: Machine Translation for Low-Resource Indonesian Languages with Large Language Models [2.186901738997927]
This paper introduces NusaMT-7B, an LLM-based machine translation model for low-resource Indonesian languages.
Our approach integrates continued pre-training on monolingual data,Supervised Fine-Tuning (SFT), self-learning, and an LLM-based data cleaner to reduce noise in parallel sentences.
Our results show that fine-tuned LLMs can enhance translation quality for low-resource languages, aiding in linguistic preservation and cross-cultural communication.
arXiv Detail & Related papers (2024-10-10T11:33:25Z) - Optimizing the Training Schedule of Multilingual NMT using Reinforcement Learning [0.3277163122167433]
We propose two algorithms that use reinforcement learning to optimize the training schedule of Multilingual NMT.
On a 8-to-1 translation dataset with LRLs and HRLs, our second method improves BLEU and COMET scores with respect to both random selection of monolingual batches and shuffled multilingual batches.
arXiv Detail & Related papers (2024-10-08T15:20:13Z) - Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation [62.202893186343935]
We explore what it would take to adapt Large Language Models for low-resource languages.
We show that parallel data is critical during both pre-training andSupervised Fine-Tuning (SFT)
Our experiments with three LLMs across two low-resourced language groups reveal consistent trends, underscoring the generalizability of our findings.
arXiv Detail & Related papers (2024-08-23T00:59:38Z) - Salute the Classic: Revisiting Challenges of Machine Translation in the
Age of Large Language Models [91.6543868677356]
The evolution of Neural Machine Translation has been influenced by six core challenges.
These challenges include domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search.
This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models.
arXiv Detail & Related papers (2024-01-16T13:30:09Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - Machine Translation for Ge'ez Language [0.0]
Machine translation for low-resource languages such as Ge'ez faces challenges such as out-of-vocabulary words, domain mismatches, and lack of labeled training data.
We develop a multilingual neural machine translation (MNMT) model based on languages relatedness.
We also experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches.
arXiv Detail & Related papers (2023-11-24T14:55:23Z) - Low-Resource Machine Translation for Low-Resource Languages: Leveraging
Comparable Data, Code-Switching and Compute Resources [4.119597443825115]
We conduct an empirical study of unsupervised neural machine translation (NMT) for truly low resource languages.
We show how adding comparable data mined using a bilingual dictionary along with modest additional compute resource to train the model can significantly improve its performance.
Our work is the first to quantitatively showcase the impact of different modest compute resource in low resource NMT.
arXiv Detail & Related papers (2021-03-24T15:40:28Z) - Self-Learning for Zero Shot Neural Machine Translation [13.551731309506874]
This work proposes a novel zero-shot NMT modeling approach that learns without the now-standard assumption of a pivot language sharing parallel data.
Compared to unsupervised NMT, consistent improvements are observed even in a domain-mismatch setting.
arXiv Detail & Related papers (2021-03-10T09:15:19Z) - Improving Target-side Lexical Transfer in Multilingual Neural Machine
Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs.
Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z) - Leveraging Monolingual Data with Self-Supervision for Multilingual
Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models.
Self-supervision improves zero-shot translation quality in multilingual models.
We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z) - Cross-lingual Supervision Improves Unsupervised Neural Machine
Translation [97.84871088440102]
We introduce a multilingual unsupervised NMT framework to leverage weakly supervised signals from high-resource language pairs to zero-resource translation directions.
Method significantly improves the translation quality by more than 3 BLEU score on six benchmark unsupervised translation directions.
arXiv Detail & Related papers (2020-04-07T05:46:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.