Towards Better Chinese-centric Neural Machine Translation for
Low-resource Languages
- URL: http://arxiv.org/abs/2204.04344v1
- Date: Sat, 9 Apr 2022 01:05:37 GMT
- Title: Towards Better Chinese-centric Neural Machine Translation for
Low-resource Languages
- Authors: Bin Li, Yixuan Weng, Fei Xia, Hanjun Deng
- Abstract summary: Building a neural machine translation (NMT) system has become an urgent trend, especially in the low-resource setting.
Recent work tends to study NMT systems for low-resource languages centered on English, while few works focus on low-resource NMT systems centered on other languages such as Chinese.
We present the winner competition system that leverages monolingual word embeddings data enhancement, bilingual curriculum learning, and contrastive re-ranking.
- Score: 12.374365655284342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The last decade has witnessed enormous improvements in science and
technology, stimulating the growing demand for economic and cultural exchanges
in various countries. Building a neural machine translation (NMT) system has
become an urgent trend, especially in the low-resource setting. However, recent
work tends to study NMT systems for low-resource languages centered on English,
while few works focus on low-resource NMT systems centered on other languages
such as Chinese. To achieve this, the low-resource multilingual translation
challenge of the 2021 iFLYTEK AI Developer Competition provides the
Chinese-centric multilingual low-resource NMT tasks, where participants are
required to build NMT systems based on the provided low-resource samples. In
this paper, we present the winner competition system that leverages monolingual
word embeddings data enhancement, bilingual curriculum learning, and
contrastive re-ranking. In addition, a new Incomplete-Trust (In-trust) loss
function is proposed to replace the traditional cross-entropy loss when
training. The experimental results demonstrate that the implementation of these
ideas leads better performance than other state-of-the-art methods. All the
experimental codes are released at:
https://github.com/WENGSYX/Low-resource-text-translation.
Related papers
- Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study [1.6819960041696331]
In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian.
Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance.
Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement.
arXiv Detail & Related papers (2024-04-12T06:16:26Z) - Relevance-guided Neural Machine Translation [5.691028372215281]
We propose an explainability-based training approach for Neural Machine Translation (NMT)
Our results show our method can be promising, particularly when training in low-resource conditions.
arXiv Detail & Related papers (2023-11-30T21:52:02Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Replicable Benchmarking of Neural Machine Translation (NMT) on
Low-Resource Local Languages in Indonesia [4.634142034755327]
This study comprehensively analyzes training NMT systems for four low-resource local languages in Indonesia: Javanese, Sundanese, Minangkabau, and Balinese.
Our research demonstrates that despite limited computational resources and textual data, several of our NMT systems achieve competitive performances.
arXiv Detail & Related papers (2023-11-02T05:27:48Z) - Boosting Unsupervised Machine Translation with Pseudo-Parallel Data [2.900810893770134]
We propose a training strategy that relies on pseudo-parallel sentence pairs mined from monolingual corpora and synthetic sentence pairs back-translated from monolingual corpora.
We reach an improvement of up to 14.5 BLEU points (English to Ukrainian) over a baseline trained on back-translated data only.
arXiv Detail & Related papers (2023-10-22T10:57:12Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine
Translation: The Case of Fon Language [0.015863809575305417]
We introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training.
We compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.
arXiv Detail & Related papers (2021-03-14T22:12:14Z) - Neural Machine Translation: Challenges, Progress and Future [62.75523637241876]
Machine translation (MT) is a technique that leverages computers to translate human languages automatically.
neural machine translation (NMT) models direct mapping between source and target languages with deep neural networks.
This article makes a review of NMT framework, discusses the challenges in NMT and introduces some exciting recent progresses.
arXiv Detail & Related papers (2020-04-13T07:53:57Z) - Pre-training via Leveraging Assisting Languages and Data Selection for
Neural Machine Translation [49.51278300110449]
We propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the languages of interest.
A case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora.
arXiv Detail & Related papers (2020-01-23T02:47:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.