Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and
Spanish-Mixtec
- URL: http://arxiv.org/abs/2305.17404v1
- Date: Sat, 27 May 2023 08:03:44 GMT
- Title: Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and
Spanish-Mixtec
- Authors: Atnafu Lambebo Tonja, Christian Maldonado-Sifuentes, David Alejandro
Mendoza Castillo, Olga Kolesnikova, No\'e Castro-S\'anchez, Grigori Sidorov,
Alexander Gelbukh
- Abstract summary: We present a parallel Spanish-Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks.
We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models.
The findings show that the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) affects translation performance and that indigenous languages work better when used as target languages.
- Score: 51.35013619649463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a parallel Spanish-Mazatec and Spanish-Mixtec
corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two
indigenous Mexican languages. We evaluated the usability of the collected
corpus using three different approaches: transformer, transfer learning, and
fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook
M2M100-48 model outperformed the other approaches, with BLEU scores of 12.09
and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively,
and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations,
respectively. The findings show that the dataset size (9,799 sentences in
Mazatec and 13,235 sentences in Mixtec) affects translation performance and
that indigenous languages work better when used as target languages. The
findings emphasize the importance of creating parallel corpora for indigenous
languages and fine-tuning models for low-resource translation tasks. Future
research will investigate zero-shot and few-shot learning approaches to further
improve translation performance in low-resource settings. The dataset and
scripts are available at
\url{https://github.com/atnafuatx/Machine-Translation-Resources}
Related papers
- Multilingual Transfer and Domain Adaptation for Low-Resource Languages of Spain [9.28989997114014]
We participated in three translation tasks: spanish to aragonese (es-arg), spanish to aranese (es-arn), and spanish to asturian (es-ast)
For these three translation tasks, we use training strategies such as multilingual transfer, regularized dropout, forward translation and back translation, labse denoising, ensemble learning and other strategies to neural machine translation (NMT) model based on training deep transformer-big architecture.
arXiv Detail & Related papers (2024-09-24T09:46:27Z) - Machine Translation Advancements of Low-Resource Indian Languages by Transfer Learning [9.373815852241648]
We employ two distinct knowledge transfer strategies to develop a reliable machine translation system for low-resource Indian languages.
For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages.
For Khasi (kh) and Mizo (mz), We trained a multilingual model as a baseline using bilingual data from these four language pairs, along with an additional about 8kw English-Bengali bilingual data.
arXiv Detail & Related papers (2024-09-24T08:53:19Z) - Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language [1.1702440973773898]
This study explores the use of large language models for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste.
Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting.
We find that including dictionary entries in prompts and a mix of sentences retrieved through-IDF and semantic embeddings significantly improves translation quality.
arXiv Detail & Related papers (2024-04-07T05:04:38Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z) - UPB at IberLEF-2023 AuTexTification: Detection of Machine-Generated Text
using Transformer Ensembles [0.5324802812881543]
This paper describes the solutions submitted by the UPB team to the AuTexTification shared task, featured as part of IberLEF-2023.
Our best-performing model achieved macro F1-scores of 66.63% on the English dataset and 67.10% on the Spanish dataset.
arXiv Detail & Related papers (2023-08-02T20:08:59Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Enhancing Translation for Indigenous Languages: Experiments with
Multilingual Models [57.10972566048735]
We present the system descriptions for three methods.
We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model.
We experimented with 11 languages from America and report the setups we used as well as the results we achieved.
arXiv Detail & Related papers (2023-05-27T08:10:40Z) - Facebook AI WMT21 News Translation Task Submission [23.69817809546458]
We describe Facebook's multilingual model submission to the WMT2021 shared task on news translation.
We participate in 14 language directions: English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese.
We utilize data from all available sources to create high quality bilingual and multilingual baselines.
arXiv Detail & Related papers (2021-08-06T18:26:38Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Pre-training Multilingual Neural Machine Translation by Leveraging
Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model.
We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.