FFR v1.1: Fon-French Neural Machine Translation
- URL: http://arxiv.org/abs/2006.09217v1
- Date: Sun, 14 Jun 2020 04:27:12 GMT
- Title: FFR v1.1: Fon-French Neural Machine Translation
- Authors: Bonaventure F. P. Dossou and Chris C. Emezue
- Abstract summary: FFR project is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French.
In this paper, we introduce FFR dataset, a corpus of Fon-to-French translations, describe the diacritical encoding process, and introduce our FFR v1.1 model.
- Score: 0.012691047660244334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: All over the world and especially in Africa, researchers are putting efforts
into building Neural Machine Translation (NMT) systems to help tackle the
language barriers in Africa, a continent of over 2000 different languages.
However, the low-resourceness, diacritical, and tonal complexities of African
languages are major issues being faced. The FFR project is a major step towards
creating a robust translation model from Fon, a very low-resource and tonal
language, to French, for research and public use. In this paper, we introduce
FFR Dataset, a corpus of Fon-to-French translations, describe the diacritical
encoding process, and introduce our FFR v1.1 model, trained on the dataset. The
dataset and model are made publicly available at https://github.com/
bonaventuredossou/ffr-v1, to promote collaboration and reproducibility.
Related papers
- Marito: Structuring and Building Open Multilingual Terminologies for South African NLP [0.9273919298354661]
Lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP.<n>We introduce the foundational emphMarito dataset, released under the equitable, Africa-centered NOODL framework.<n>Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation.
arXiv Detail & Related papers (2025-08-05T15:00:02Z) - Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.
We consider how to adapt LLMs to low-resource African languages.
We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Low-Resourced Machine Translation for Senegalese Wolof Language [0.34376560669160383]
We present a parallel Wolof/French corpus of 123,000 sentences on which we conducted experiments on machine translation models based on Recurrent Neural Networks (RNN)
We noted performance gains with the models trained on subworded data as well as those trained on the French-English language pair compared to those trained on the French-Wolof pair under the same experimental conditions.
arXiv Detail & Related papers (2023-05-01T00:04:19Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - MMTAfrica: Multilingual Machine Translation for African Languages [0.010742675209112621]
We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages.
For multilingual translation concerning African languages, we introduce a novel backtranslation and reconstruction objective, BT&REC.
We report improvements from MMTAfrica over the FLORES 101 benchmarks.
arXiv Detail & Related papers (2022-04-08T21:42:44Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine
Translation: The Case of Fon Language [0.015863809575305417]
We introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training.
We compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.
arXiv Detail & Related papers (2021-03-14T22:12:14Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - FFR V1.0: Fon-French Neural Machine Translation [0.012691047660244334]
Africa has the highest linguistic diversity in the world.
The low-resources, diacritical and tonal complexities of African languages are major issues facing African NLP today.
This paper describes our pilot project: the creation of a large growing corpora for Fon-to-French translations and our FFR v1.0 model, trained on this dataset.
arXiv Detail & Related papers (2020-03-26T19:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.