Related papers: Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data

URL: http://arxiv.org/abs/2311.17492v2
Date: Fri, 12 Jan 2024 14:18:03 GMT
Title: Mergen: The First Manchu-Korean Machine Translation Model Trained on Augmented Data
Authors: Jean Seo, Sungjoo Byun, Minha Kang, Sangah Lee
Abstract summary: We introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation model. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings. Experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.

Related papers

Towards Cultural Bridge by Bahnaric-Vietnamese Translation Using Transfer Learning of Sequence-To-Sequence Pre-training Language Model [0.24578723416255754]
This work explores the journey towards achieving Bahnaric-Vietnamese translation for the sake of culturally bridging the two ethnic groups in Vietnam.<n>The most prominent challenge is the lack of available original Bahnaric resources source language.<n>We leverage a transfer learning approach using sequence-to-sequence pre-training language model.
arXiv Detail & Related papers (2025-05-16T16:33:36Z)
An Efficient Approach for Machine Translation on Low-resource Languages: A Case Study in Vietnamese-Chinese [1.6932009464531739]
We proposed an approach for machine translation in low-resource languages such as Vietnamese-Chinese. Our proposed method leveraged the power of the multilingual pre-trained language model (mBART) and both Vietnamese and Chinese monolingual corpus.
arXiv Detail & Related papers (2025-01-31T17:11:45Z)
A Tulu Resource for Machine Translation [3.038642416291856]
We present the first parallel dataset for English-Tulu translation. Tulu is spoken by approximately 2.5 million individuals in southwestern India. Our English-Tulu system, trained without using parallel English-Tulu data, outperforms Google Translate by 19 BLEU points.
arXiv Detail & Related papers (2024-03-28T04:30:07Z)
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z)
Learning and Analyzing Generation Order for Undirected Sequence Models [86.10875837475783]
We train a policy that learns the generation order for a pre-trained, undirected translation model via reinforcement learning. We show that the translations by our learned orders achieve higher BLEU scores than the outputs decoded from left to right or decoded by the learned order from Mansimov et al. Our findings could provide more insights on the mechanism of undirected generation models and encourage further research in this direction.
arXiv Detail & Related papers (2021-12-16T18:29:07Z)
Improvement in Machine Translation with Generative Adversarial Networks [0.9612136532344103]
We take inspiration from RelGAN, a model for text generation, and NMT-GAN, an adversarial machine translation model, to implement a model that learns to transform awkward, non-fluent English sentences to fluent ones. We utilize a parameter $lambda$ to control the amount of deviation from the input sentence, i.e. a trade-off between keeping the original tokens and modifying it to be more fluent.
arXiv Detail & Related papers (2021-11-30T06:51:13Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models. We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z)
Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting. Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking. We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z)
Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation [5.381004207943597]
We release two Minangkabau corpora: sentiment analysis and machine translation that are harvested and constructed from Twitter and Wikipedia. We conduct the first computational linguistics in Minangkabau language employing classic machine learning and sequence-to-sequence models such as LSTM and Transformer.
arXiv Detail & Related papers (2020-09-19T22:13:27Z)
Reusing a Pretrained Language Model on Languages with Limited Corpora for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language. The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model. Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
HausaMT v1.0: Towards English-Hausa Neural Machine Translation [0.012691047660244334]
We build a baseline model for English-Hausa machine translation. The Hausa language is the second largest Afro-Asiatic language in the world after Arabic.
arXiv Detail & Related papers (2020-06-09T02:08:03Z)
Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised Neural Machine Translation [5.958653653305609]
We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences. This automatically expands the vocabulary of the model while maintaining high quality content.
arXiv Detail & Related papers (2020-04-05T02:14:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.