HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea
- URL: http://arxiv.org/abs/2210.05112v1
- Date: Tue, 11 Oct 2022 03:04:28 GMT
- Title: HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea
- Authors: Haneul Yoo, Jiho Jin, Juhee Son, JinYeong Bak, Kyunghyun Cho, Alice Oh
- Abstract summary: We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
- Score: 59.35609710776603
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Historical records in Korea before the 20th century were primarily written in
Hanja, an extinct language based on Chinese characters and not understood by
modern Korean or Chinese speakers. Historians with expertise in this time
period have been analyzing the documents, but that process is very difficult
and time-consuming, and language models would significantly speed up the
process. Toward building and evaluating language models for Hanja, we release
the Hanja Understanding Evaluation dataset consisting of chronological
attribution, topic classification, named entity recognition, and summary
retrieval tasks. We also present BERT-based models continued training on the
two major corpora from the 14th to the 19th centuries: the Annals of the Joseon
Dynasty and Diaries of the Royal Secretariats. We compare the models with
several baselines on all tasks and show there are significant improvements
gained by training on the two corpora. Additionally, we run zero-shot
experiments on the Daily Records of the Royal Court and Important Officials
(DRRI). The DRRI dataset has not been studied much by the historians, and not
at all by the NLP community.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - HistRED: A Historical Document-Level Relation Extraction Dataset [32.96963890713529]
HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
arXiv Detail & Related papers (2023-07-10T00:24:27Z) - Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary [1.105375732595832]
NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
arXiv Detail & Related papers (2023-06-26T11:00:35Z) - Translating Hanja Historical Documents to Contemporary Korean and
English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z) - Restoring and Mining the Records of the Joseon Dynasty via Neural
Language Modeling and Machine Translation [20.497110880878544]
We present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism.
Our approach significantly improves the accuracy of the translation task than baselines without multi-task learning.
arXiv Detail & Related papers (2021-04-13T06:40:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.