HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja
- URL: http://arxiv.org/abs/2501.11951v1
- Date: Tue, 21 Jan 2025 07:49:51 GMT
- Title: HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja
- Authors: Seyoung Song, Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh,
- Abstract summary: HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding.
HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean.
- Score: 48.07219104902607
- License:
- Abstract: While Korean historical documents are invaluable cultural heritage, understanding those documents requires in-depth Hanja expertise. Hanja is an ancient language used in Korea before the 20th century, whose characters were borrowed from old Chinese but had evolved in Korea for centuries. Modern Koreans and Chinese cannot understand Korean historical documents without substantial additional help, and while previous efforts have produced some Korean and English translations, this requires in-depth expertise, and so most of the documents are not translated into any modern language. To address this gap, we present HERITAGE, the first open-source Hanja NLP toolkit to assist in understanding and translating the unexplored Korean historical documents written in Hanja. HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding via Hanja language models: punctuation restoration, named entity recognition, and machine translation (MT). HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean, as well as character-level English definition. HERITAGE serves two purposes. First, anyone interested in these documents can get a general understanding from the model predictions and the interactive glossary, especially MT outputs in Korean and English. Second, since the model outputs are not perfect, Hanja experts can revise them to produce better annotations and translations. This would boost the translation efficiency and potentially lead to most of the historical documents being translated into modern languages, lowering the barrier on unexplored Korean historical documents.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - HyperCLOVA X Technical Report [119.94633129762133]
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture.
HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets.
The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English.
arXiv Detail & Related papers (2024-04-02T13:48:49Z) - HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models [0.0]
We introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth.
The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension.
arXiv Detail & Related papers (2023-09-06T04:38:16Z) - HistRED: A Historical Document-Level Relation Extraction Dataset [32.96963890713529]
HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
arXiv Detail & Related papers (2023-07-10T00:24:27Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Translating Hanja Historical Documents to Contemporary Korean and
English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.