Translating Hanja Historical Documents to Contemporary Korean and
English
- URL: http://arxiv.org/abs/2205.10019v5
- Date: Fri, 29 Dec 2023 12:18:45 GMT
- Title: Translating Hanja Historical Documents to Contemporary Korean and
English
- Authors: Juhee Son, Jiho Jin, Haneul Yoo, JinYeong Bak, Kyunghyun Cho, Alice Oh
- Abstract summary: Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
- Score: 52.625998002213585
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of
Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals
were originally written in an archaic Korean writing system, `Hanja', and were
translated into Korean from 1968 to 1993. The resulting translation was however
too literal and contained many archaic Korean words; thus, a new expert
translation effort began in 2012. Since then, the records of only one king have
been completed in a decade. In parallel, expert translators are working on
English translation, also at a slow pace and produced only one king's records
in English so far. Thus, we propose H2KE, a neural machine translation model,
that translates historical documents in Hanja to more easily understandable
Korean and to English. Built on top of multilingual neural machine translation,
H2KE learns to translate a historical document written in Hanja, from both a
full dataset of outdated Korean translation and a small dataset of more
recently translated contemporary Korean and English. We compare our method
against two baselines: a recent model that simultaneously learns to restore and
translate Hanja historical document and a Transformer based model trained only
on newly translated corpora. The experiments reveal that our method
significantly outperforms the baselines in terms of BLEU scores for both
contemporary Korean and English translations. We further conduct extensive
human evaluation which shows that our translation is preferred over the
original expert translations by both experts and non-expert Korean speakers.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Punctuation restoration Model and Spacing Model for Korean Ancient
Document [0.5524804393257919]
In Korean ancient documents, there is no spacing or punctuation, and they are written in classical Chinese characters.
While China has models predicting punctuation and spacing, applying them directly to Korean texts is problematic due to data differences.
We developed the first models which predict punctuation and spacing for Korean historical texts and evaluated their performance.
arXiv Detail & Related papers (2023-12-19T06:15:52Z) - HistRED: A Historical Document-Level Relation Extraction Dataset [32.96963890713529]
HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
arXiv Detail & Related papers (2023-07-10T00:24:27Z) - Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models [17.749113496737106]
We construct the first Classical-Chinese-to-Kanbun dataset in the world.
Character reordering and machine translation play a significant role in Kanbun comprehension.
We release our code and dataset on GitHub.
arXiv Detail & Related papers (2023-05-22T06:30:02Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.