Replicable Benchmarking of Neural Machine Translation (NMT) on
Low-Resource Local Languages in Indonesia
- URL: http://arxiv.org/abs/2311.00998v1
- Date: Thu, 2 Nov 2023 05:27:48 GMT
- Title: Replicable Benchmarking of Neural Machine Translation (NMT) on
Low-Resource Local Languages in Indonesia
- Authors: Lucky Susanto, Ryandito Diandaru, Adila Krisnadhi, Ayu Purwarianti,
Derry Wijaya
- Abstract summary: This study comprehensively analyzes training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese.
Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances.
- Score: 4.634142034755327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural machine translation (NMT) for low-resource local languages in
Indonesia faces significant challenges, including the need for a representative
benchmark and limited data availability. This work addresses these challenges
by comprehensively analyzing training NMT systems for four low-resource local
languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese. Our
study encompasses various training approaches, paradigms, data sizes, and a
preliminary study into using large language models for synthetic low-resource
languages parallel data generation. We reveal specific trends and insights into
practical strategies for low-resource language translation. Our research
demonstrates that despite limited computational resources and textual data,
several of our NMT systems achieve competitive performances, rivaling the
translation quality of zero-shot gpt-3.5-turbo. These findings significantly
advance NMT for low-resource languages, offering valuable guidance for
researchers in similar contexts.
Related papers
- Relevance-guided Neural Machine Translation [5.691028372215281]
We propose an explainability-based training approach for Neural Machine Translation (NMT)
Our results show our method can be promising, particularly when training in low-resource conditions.
arXiv Detail & Related papers (2023-11-30T21:52:02Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Neural Machine Translation For Low Resource Languages [0.0]
This paper investigates the realm of low resource languages and builds a Neural Machine Translation model to achieve state-of-the-art results.
The paper looks to build upon the mBART language model and explore strategies to augment it with various NLP and Deep Learning techniques.
arXiv Detail & Related papers (2023-04-16T19:27:48Z) - NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local
Languages [100.59889279607432]
We focus on developing resources for languages in Indonesia.
Most languages in Indonesia are categorized as endangered and some are even extinct.
We develop the first-ever parallel resource for 10 low-resource languages in Indonesia.
arXiv Detail & Related papers (2022-05-31T17:03:50Z) - Towards Better Chinese-centric Neural Machine Translation for
Low-resource Languages [12.374365655284342]
Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting.
Recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese.
We present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking.
arXiv Detail & Related papers (2022-04-09T01:05:37Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Adapting High-resource NMT Models to Translate Low-resource Related
Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages.
In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data.
Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z) - Low-Resource Adaptation of Neural NLP Models [0.30458514384586405]
This thesis investigates methods for dealing with low-resource scenarios in information extraction and natural language understanding.
We develop and adapt neural NLP models to explore a number of research questions concerning NLP tasks with minimal or no training data.
arXiv Detail & Related papers (2020-11-09T12:13:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.