Mergen: The First Manchu-Korean Machine Translation Model Trained on
Augmented Data
- URL: http://arxiv.org/abs/2311.17492v2
- Date: Fri, 12 Jan 2024 14:18:03 GMT
- Title: Mergen: The First Manchu-Korean Machine Translation Model Trained on
Augmented Data
- Authors: Jean Seo, Sungjoo Byun, Minha Kang, Sangah Lee
- Abstract summary: We introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation model.
Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings.
Experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Manchu language, with its roots in the historical Manchurian region of
Northeast China, is now facing a critical threat of extinction, as there are
very few speakers left. In our efforts to safeguard the Manchu language, we
introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation
(MT) model. To develop this model, we utilize valuable resources such as the
Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the
scarcity of a Manchu-Korean parallel dataset, we expand our data by employing
word replacement guided by GloVe embeddings, trained on both monolingual and
parallel texts. Our approach is built around an encoder-decoder neural machine
translation model, incorporating a bi-directional Gated Recurrent Unit (GRU)
layer. The experiments have yielded promising results, showcasing a significant
enhancement in Manchu-Korean translation, with a remarkable 20-30 point
increase in the BLEU score.
Related papers
- A Tulu Resource for Machine Translation [3.038642416291856]
We present the first parallel dataset for English-Tulu translation.
Tulu is spoken by approximately 2.5 million individuals in southwestern India.
Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
arXiv Detail & Related papers (2024-03-28T04:30:07Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - Learning and Analyzing Generation Order for Undirected Sequence Models [86.10875837475783]
We train a policy that learns the generation order for a pre-trained, undirected translation model via reinforcement learning.
We show that the translations by our learned orders achieve higher BLEU scores than the outputs decoded from left to right or decoded by the learned order from Mansimov et al.
Our findings could provide more insights on the mechanism of undirected generation models and encourage further research in this direction.
arXiv Detail & Related papers (2021-12-16T18:29:07Z) - Improvement in Machine Translation with Generative Adversarial Networks [0.9612136532344103]
We take inspiration from RelGAN, a model for text generation, and NMT-GAN, an adversarial machine translation model, to implement a model that learns to transform awkward, non-fluent English sentences to fluent ones.
We utilize a parameter $lambda$ to control the amount of deviation from the input sentence, i.e. a trade-off between keeping the original tokens and modifying it to be more fluent.
arXiv Detail & Related papers (2021-11-30T06:51:13Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - Source and Target Bidirectional Knowledge Distillation for End-to-end
Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models.
We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Towards Computational Linguistics in Minangkabau Language: Studies on
Sentiment Analysis and Machine Translation [5.381004207943597]
We release two Minangkabau corpora: sentiment analysis and machine translation that are harvested and constructed from Twitter and Wikipedia.
We conduct the first computational linguistics in Minangkabau language employing classic machine learning and sequence-to-sequence models such as LSTM and Transformer.
arXiv Detail & Related papers (2020-09-19T22:13:27Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - HausaMT v1.0: Towards English-Hausa Neural Machine Translation [0.012691047660244334]
We build a baseline model for English-Hausa machine translation.
The Hausa language is the second largest Afro-Asiatic language in the world after Arabic.
arXiv Detail & Related papers (2020-06-09T02:08:03Z) - Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised
Neural Machine Translation [5.958653653305609]
We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences.
This automatically expands the vocabulary of the model while maintaining high quality content.
arXiv Detail & Related papers (2020-04-05T02:14:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.