Learning How to Translate North Korean through South Korean
- URL: http://arxiv.org/abs/2201.11258v1
- Date: Thu, 27 Jan 2022 01:21:29 GMT
- Title: Learning How to Translate North Korean through South Korean
- Authors: Hwichan Kim, Sangwhan Moon, Naoaki Okazaki, and Mamoru Komachi
- Abstract summary: South and North Korea both use the Korean language.
Existing NLP systems of the Korean language cannot handle North Korean inputs.
We create data for North Korean NMT models using a comparable corpus.
We verify that a model trained by North Korean bilingual data without human annotation can significantly boost North Korean translation accuracy.
- Score: 24.38451366384134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: South and North Korea both use the Korean language. However, Korean NLP
research has focused on South Korean only, and existing NLP systems of the
Korean language, such as neural machine translation (NMT) models, cannot
properly handle North Korean inputs. Training a model using North Korean data
is the most straightforward approach to solving this problem, but there is
insufficient data to train NMT models. In this study, we create data for North
Korean NMT models using a comparable corpus. First, we manually create
evaluation data for automatic alignment and machine translation. Then, we
investigate automatic alignment methods suitable for North Korean. Finally, we
verify that a model trained by North Korean bilingual data without human
annotation can significantly boost North Korean translation accuracy compared
to existing South Korean models in zero-shot settings.
Related papers
- RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining [0.0]
We present RedWhale, a model specifically tailored for Korean language processing.
RedWhale is developed using an efficient continual pretraining approach that includes a comprehensive Korean corpus preprocessing pipeline.
Experimental results demonstrate that RedWhale outperforms other leading models on Korean NLP benchmarks.
arXiv Detail & Related papers (2024-08-21T02:49:41Z) - Better Datastore, Better Translation: Generating Datastores from
Pre-Trained Models for Nearest Neural Machine Translation [48.58899349349702]
Nearest Neighbor Machine Translation (kNNMT) is a simple and effective method of augmenting neural machine translation (NMT) with a token-level nearest neighbor retrieval mechanism.
In this paper, we propose PRED, a framework that leverages Pre-trained models for Datastores in kNN-MT.
arXiv Detail & Related papers (2022-12-17T08:34:20Z) - Towards standardizing Korean Grammatical Error Correction: Datasets and
Annotation [26.48270086631483]
We provide datasets that cover a wide range of Korean grammatical errors.
We then define 14 error types for Korean and provide KAGAS, which can automatically annotate error types from parallel corpora.
We show that the model trained with our datasets significantly outperforms the currently used statistical Korean GEC system (Hanspell) on a wider range of error types.
arXiv Detail & Related papers (2022-10-25T23:41:52Z) - Towards Robust k-Nearest-Neighbor Machine Translation [72.9252395037097]
k-Nearest-Neighbor Machine Translation (kNN-MT) becomes an important research direction of NMT in recent years.
Its main idea is to retrieve useful key-value pairs from an additional datastore to modify translations without updating the NMT model.
The underlying retrieved noisy pairs will dramatically deteriorate the model performance.
We propose a confidence-enhanced kNN-MT model with robust training to alleviate the impact of noise.
arXiv Detail & Related papers (2022-10-17T07:43:39Z) - Translating Hanja Historical Documents to Contemporary Korean and
English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z) - Design of a novel Korean learning application for efficient
pronunciation correction [2.008880264104061]
Speech recognition, speech-to-text, and speech-to-waveform are the three key systems in the proposed system.
The software will then display the user's phrase and answer, with mispronounced elements highlighted in red.
arXiv Detail & Related papers (2022-05-04T11:19:29Z) - Korean Tokenization for Beam Search Rescoring in Speech Recognition [13.718396242036818]
We propose a Korean tokenization method for neural network-based LM used for Korean ASR.
A new tokenization method that inserts a special token, SkipTC, when there is no trailing consonant in a Korean syllable is proposed.
Experiments show that the proposed approach achieves a lower word error rate compared to the same LM model without SkipTC.
arXiv Detail & Related papers (2022-02-22T11:25:01Z) - Non-Parametric Online Learning from Human Feedback for Neural Machine
Translation [54.96594148572804]
We study the problem of online learning with human feedback in the human-in-the-loop machine translation.
Previous methods require online model updating or additional translation memory networks to achieve high-quality performance.
We propose a novel non-parametric online learning method without changing the model structure.
arXiv Detail & Related papers (2021-09-23T04:26:15Z) - Language Modeling, Lexical Translation, Reordering: The Training Process
of NMT through the Lens of Classical SMT [64.1841519527504]
neural machine translation uses a single neural network to model the entire translation process.
Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training.
arXiv Detail & Related papers (2021-09-03T09:38:50Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language
Understanding [4.576330530169462]
Natural language inference (NLI) and semantic textual similarity (STS) are key tasks in natural language understanding (NLU)
There are no publicly available NLI or STS datasets in the Korean language.
We construct and release new datasets for Korean NLI and STS, dubbed KorNLI and KorSTS, respectively.
arXiv Detail & Related papers (2020-04-07T11:49:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.