A Survey on Low-Resource Neural Machine Translation
- URL: http://arxiv.org/abs/2107.04239v1
- Date: Fri, 9 Jul 2021 06:26:38 GMT
- Title: A Survey on Low-Resource Neural Machine Translation
- Authors: Rui Wang and Xu Tan and Renqian Luo and Tao Qin and Tie-Yan Liu
- Abstract summary: We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
- Score: 106.51056217748388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural approaches have achieved state-of-the-art accuracy on machine
translation but suffer from the high cost of collecting large scale parallel
data. Thus, a lot of research has been conducted for neural machine translation
(NMT) with very limited parallel data, i.e., the low-resource setting. In this
paper, we provide a survey for low-resource NMT and classify related works into
three categories according to the auxiliary data they used: (1) exploiting
monolingual data of source and/or target languages, (2) exploiting data from
auxiliary languages, and (3) exploiting multi-modal data. We hope that our
survey can help researchers to better understand this field and inspire them to
design better algorithms, and help industry practitioners to choose appropriate
algorithms for their applications.
Related papers
- Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages [5.376127198656944]
We compare three dataset creation strategies: (1) LLM-assisted dataset generation, (2) machine translation, and (3) human-written data by native speakers, to build a culturally nuanced story comprehension dataset.
Our findings indicate that LLM-assisted data creation outperforms machine translation.
arXiv Detail & Related papers (2025-02-18T15:14:58Z) - Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.
This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.
Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - Importance-Aware Data Augmentation for Document-Level Neural Machine
Translation [51.74178767827934]
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive.
Due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity.
We propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients.
arXiv Detail & Related papers (2024-01-27T09:27:47Z) - Replicable Benchmarking of Neural Machine Translation (NMT) on
Low-Resource Local Languages in Indonesia [4.634142034755327]
This study comprehensively analyzes training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese.
Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances.
arXiv Detail & Related papers (2023-11-02T05:27:48Z) - Textual Augmentation Techniques Applied to Low Resource Machine
Translation: Case of Swahili [1.9686054517684888]
In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available.
We study and apply three simple data augmentation techniques popularly used in text classification tasks.
We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
arXiv Detail & Related papers (2023-06-12T20:43:24Z) - Adapting to the Low-Resource Double-Bind: Investigating Low-Compute
Methods on Low-Resource African Languages [0.6833698896122186]
Access to high computational resources added to the issue of data scarcity of African languages.
We evaluate language adapters as cost-effective approaches to low-resource African NLP.
This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
arXiv Detail & Related papers (2023-03-29T19:25:43Z) - Beyond Triplet: Leveraging the Most Data for Multimodal Machine
Translation [53.342921374639346]
Multimodal machine translation aims to improve translation quality by incorporating information from other modalities, such as vision.
Previous MMT systems mainly focus on better access and use of visual information and tend to validate their methods on image-related datasets.
This paper establishes new methods and new datasets for MMT.
arXiv Detail & Related papers (2022-12-20T15:02:38Z) - Exploiting Neural Query Translation into Cross Lingual Information
Retrieval [49.167049709403166]
Existing CLIR systems mainly exploit statistical-based machine translation (SMT) rather than the advanced neural machine translation (NMT)
We propose a novel data augmentation method that extracts query translation pairs according to user clickthrough data.
Experimental results reveal that the proposed approach yields better retrieval quality than strong baselines.
arXiv Detail & Related papers (2020-10-26T15:28:19Z) - Multi-task Learning for Multilingual Neural Machine Translation [32.81785430242313]
We propose a multi-task learning framework that jointly trains the model with the translation task on bitext data and two denoising tasks on the monolingual data.
We show that the proposed approach can effectively improve the translation quality for both high-resource and low-resource languages.
arXiv Detail & Related papers (2020-10-06T06:54:12Z) - Leveraging Monolingual Data with Self-Supervision for Multilingual
Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models.
Self-supervision improves zero-shot translation quality in multilingual models.
We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.