Participatory Research for Low-resourced Machine Translation: A Case
Study in African Languages
- URL: http://arxiv.org/abs/2010.02353v2
- Date: Fri, 6 Nov 2020 23:30:45 GMT
- Title: Participatory Research for Low-resourced Machine Translation: A Case
Study in African Languages
- Authors: Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa,
Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen
Hassan Muhammad, Salomon Kabongo, Salomey Osei, Sackey Freshia, Rubungo Andre
Niyongabo, Ricky Macharm, Perez Ogayo, Orevaoghene Ahia, Musie Meressa, Mofe
Adeyemi, Masabata Mokgesi-Selinga, Lawrence Okegbemi, Laura Jane Martinus,
Kolawole Tajudeen, Kevin Degila, Kelechi Ogueji, Kathleen Siminyu, Julia
Kreutzer, Jason Webster, Jamiil Toure Ali, Jade Abbott, Iroro Orife, Ignatius
Ezeani, Idris Abdulkabir Dangana, Herman Kamper, Hady Elsahar, Goodness Duru,
Ghollah Kioko, Espoir Murhabazi, Elan van Biljon, Daniel Whitenack,
Christopher Onyefuluchi, Chris Emezue, Bonaventure Dossou, Blessing Sibanda,
Blessing Itoro Bassey, Ayodele Olabiyi, Arshath Ramkilowan, Alp \"Oktem,
Adewale Akinfaderin, Abdallah Bashir
- Abstract summary: "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society.
We propose participatory research as a means to involve all necessary agents required in the Machine Translation development process.
- Score: 15.859824747983556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research in NLP lacks geographic diversity, and the question of how NLP can
be scaled to low-resourced languages has not yet been adequately solved.
"Low-resourced"-ness is a complex problem going beyond data availability and
reflects systemic problems in society. In this paper, we focus on the task of
Machine Translation (MT), that plays a crucial role for information
accessibility and communication worldwide. Despite immense improvements in MT
over the past decade, MT is centered around a few high-resourced languages. As
MT researchers cannot solve the problem of low-resourcedness alone, we propose
participatory research as a means to involve all necessary agents required in
the MT development process. We demonstrate the feasibility and scalability of
participatory research with a case study on MT for African languages. Its
implementation leads to a collection of novel translation datasets, MT
benchmarks for over 30 languages, with human evaluations for a third of them,
and enables participants without formal training to make a unique scientific
contribution. Benchmarks, models, data, code, and evaluation results are
released under https://github.com/masakhane-io/masakhane-mt.
Related papers
- Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.
This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.
Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - From Priest to Doctor: Domain Adaptaion for Low-Resource Neural Machine Translation [3.666125285899499]
Many languages have insufficient data to train high-performing general neural machine translation (NMT) models.
Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models.
arXiv Detail & Related papers (2024-12-01T21:06:08Z) - Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service [7.299910666525873]
We propose an observational study on actual usage patterns of a specialized MT service for the Tetun language in Timor-Leste.
Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora.
Our results suggest that MT systems for minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts.
arXiv Detail & Related papers (2024-11-19T06:21:51Z) - Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation [62.202893186343935]
We explore what it would take to adapt Large Language Models for low-resource languages.
We show that parallel data is critical during both pre-training andSupervised Fine-Tuning (SFT)
Our experiments with three LLMs across two low-resourced language groups reveal consistent trends, underscoring the generalizability of our findings.
arXiv Detail & Related papers (2024-08-23T00:59:38Z) - EthioMT: Parallel Corpus for Low-resource Ethiopian Languages [49.80726355048843]
We introduce EthioMT -- a new parallel corpus for 15 languages.
We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia.
We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
arXiv Detail & Related papers (2024-03-28T12:26:45Z) - Replicable Benchmarking of Neural Machine Translation (NMT) on
Low-Resource Local Languages in Indonesia [4.634142034755327]
This study comprehensively analyzes training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese.
Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances.
arXiv Detail & Related papers (2023-11-02T05:27:48Z) - ChatGPT MT: Competitive for High- (but not Low-) Resource Languages [62.178282377729566]
Large language models (LLMs) implicitly learn to perform a range of language tasks, including machine translation (MT)
We present the first experimental evidence for an expansive set of 204 languages, along with MT cost analysis.
Our analysis reveals that a language's resource level is the most important feature in determining ChatGPT's relative ability to translate it.
arXiv Detail & Related papers (2023-09-14T04:36:00Z) - Dictionary-based Phrase-level Prompting of Large Language Models for
Machine Translation [91.57514888410205]
Large language models (LLMs) demonstrate remarkable machine translation (MT) abilities via prompting.
LLMs can struggle to translate inputs with rare words, which are common in low resource or domain transfer scenarios.
We show that LLM prompting can provide an effective solution for rare words as well, by using prior knowledge from bilingual dictionaries to provide control hints in the prompts.
arXiv Detail & Related papers (2023-02-15T18:46:42Z) - Towards Better Chinese-centric Neural Machine Translation for
Low-resource Languages [12.374365655284342]
Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting.
Recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese.
We present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking.
arXiv Detail & Related papers (2022-04-09T01:05:37Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - On the Integration of LinguisticFeatures into Statistical and Neural
Machine Translation [2.132096006921048]
We investigate the discrepancies between the strengths of statistical approaches to machine translation and the way humans translate.
We identify linguistic information that is lacking in order for automatic translation systems to produce more accurate translations.
We identify overgeneralization or 'algomic bias' as a potential drawback of neural MT and link it to many of the remaining linguistic issues.
arXiv Detail & Related papers (2020-03-31T16:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.