Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
- URL: http://arxiv.org/abs/2510.24541v1
- Date: Tue, 28 Oct 2025 15:43:26 GMT
- Title: Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
- Authors: Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh,
- Abstract summary: We introduce the Open Korean Historical Corpus, a dataset spanning 1,300 years and 6 languages.<n>This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025.<n>This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language.
- Score: 52.754009498236684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
Related papers
- Targum -- A Multilingual New Testament Translation Corpus [46.390064640459]
We introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102)<n>Each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision.<n>This canonicalization empowers researchers to define "uniqueness" for their own needs.
arXiv Detail & Related papers (2026-02-10T12:27:57Z) - HERITAGE: An End-to-End Web Platform for Processing Korean Historical Documents in Hanja [48.07219104902607]
HERITAGE is a web-based platform providing model predictions of three critical tasks in historical document understanding.<n> HERITAGE also provides an interactive glossary, which provides the character-level reading of the Hanja characters in modern Korean.
arXiv Detail & Related papers (2025-01-21T07:49:51Z) - When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Efficient and Effective Vocabulary Expansion Towards Multilingual Large
Language Models [9.359647125218359]
This report introduces textttEEVE-Korean-v1.0, a Korean adaptation of large language models.
Our method can significantly boost non-English proficiency within just 2 billion tokens.
arXiv Detail & Related papers (2024-02-22T17:12:39Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Translating Hanja Historical Documents to Contemporary Korean and
English [52.625998002213585]
Annals of Joseon Dynasty contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.
The Annals were originally written in an archaic Korean writing system, Hanja', and were translated into Korean from 1968 to 1993.
Since then, the records of only one king have been completed in a decade.
We propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.
arXiv Detail & Related papers (2022-05-20T08:25:11Z) - Corpus of Chinese Dynastic Histories: Gender Analysis over Two Millennia [3.2851864672627618]
dynastic histories form a large continuous linguistic space of approximately 2000 years, from the 3rd century BCE to the 18th century CE.
The histories are documented in Classical (Literary) Chinese in a corpus of over 20 million characters, suitable for the computational analysis of historical lexicon and semantic change.
This project introduces a new open-source corpus of twenty-four dynastic histories covered by Creative Commons license.
arXiv Detail & Related papers (2020-05-18T15:14:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.