Restoring and Mining the Records of the Joseon Dynasty via Neural
Language Modeling and Machine Translation
- URL: http://arxiv.org/abs/2104.05964v2
- Date: Wed, 14 Apr 2021 06:18:25 GMT
- Title: Restoring and Mining the Records of the Joseon Dynasty via Neural
Language Modeling and Machine Translation
- Authors: Kyeongpil Kang, Kyohoon Jin, Soyoung Yang, Sujin Jang, Jaegul Choo,
Youngbin Kim
- Abstract summary: We present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism.
Our approach significantly improves the accuracy of the translation task than baselines without multi-task learning.
- Score: 20.497110880878544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding voluminous historical records provides clues on the past in
various aspects, such as social and political issues and even natural science
facts. However, it is generally difficult to fully utilize the historical
records, since most of the documents are not written in a modern language and
part of the contents are damaged over time. As a result, restoring the damaged
or unrecognizable parts as well as translating the records into modern
languages are crucial tasks. In response, we present a multi-task learning
approach to restore and translate historical documents based on a
self-attention mechanism, specifically utilizing two Korean historical records,
ones of the most voluminous historical records in the world. Experimental
results show that our approach significantly improves the accuracy of the
translation task than baselines without multi-task learning. In addition, we
present an in-depth exploratory analysis on our translated results via topic
modeling, uncovering several significant historical events.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents.
We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.
We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z) - Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary [1.105375732595832]
NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
arXiv Detail & Related papers (2023-06-26T11:00:35Z) - Multilingual Event Extraction from Historical Newspaper Adverts [42.987470570997694]
This paper focuses on the under-explored task of event extraction from a novel domain of historical texts.
We introduce a new multilingual dataset in English, French, and Dutch composed of newspaper ads from the early modern colonial period.
We find that even with scarce annotated data, it is possible to achieve surprisingly good results by formulating the problem as an extractive QA task.
arXiv Detail & Related papers (2023-05-18T12:40:41Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Placing (Historical) Facts on a Timeline: A Classification cum Coref
Resolution Approach [4.809236881780707]
A timeline provides one of the most effective ways to visualize the important historical facts that occurred over a period of time.
We introduce a two staged system for event timeline generation from multiple (historical) text documents.
Our results can be extremely helpful for historians, in advancing research in history and in understanding the socio-political landscape of a country.
arXiv Detail & Related papers (2022-06-28T15:36:44Z) - Summarising Historical Text in Modern Languages [13.886432536330805]
We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language.
This is a fundamentally important routine to historians and digital humanities researchers but has never been automated.
We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese.
arXiv Detail & Related papers (2021-01-26T13:00:07Z) - Ranking Enhanced Dialogue Generation [77.8321855074999]
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation.
Previous works usually employ various neural network architectures to model the history.
This paper proposes a Ranking Enhanced Dialogue generation framework.
arXiv Detail & Related papers (2020-08-13T01:49:56Z) - Combining Visual and Textual Features for Semantic Segmentation of
Historical Newspapers [2.5899040911480187]
We introduce a multimodal approach for the semantic segmentation of historical newspapers.
Based on experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features.
Results show consistent improvement of multimodal models in comparison to a strong visual baseline.
arXiv Detail & Related papers (2020-02-14T17:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.