HistRED: A Historical Document-Level Relation Extraction Dataset
- URL: http://arxiv.org/abs/2307.04285v1
- Date: Mon, 10 Jul 2023 00:24:27 GMT
- Title: HistRED: A Historical Document-Level Relation Extraction Dataset
- Authors: Soyoung Yang, Minseok Choi, Youngwoo Cho, Jaegul Choo
- Abstract summary: HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
- Score: 32.96963890713529
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the extensive applications of relation extraction (RE) tasks in
various domains, little has been explored in the historical context, which
contains promising data across hundreds and thousands of years. To promote the
historical RE research, we present HistRED constructed from Yeonhaengnok.
Yeonhaengnok is a collection of records originally written in Hanja, the
classical Chinese writing, which has later been translated into Korean. HistRED
provides bilingual annotations such that RE can be performed on Korean and
Hanja texts. In addition, HistRED supports various self-contained subtexts with
different lengths, from a sentence level to a document level, supporting
diverse context settings for researchers to evaluate the robustness of their RE
models. To demonstrate the usefulness of our dataset, we propose a bilingual RE
model that leverages both Korean and Hanja contexts to predict relations
between entities. Our model outperforms monolingual baselines on HistRED,
showing that employing multiple language contexts supplements the RE
predictions. The dataset is publicly available at:
https://huggingface.co/datasets/Soyoung/HistRED under CC BY-NC-ND 4.0 license.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Multilingual Coreference Resolution in Low-resource South Asian Languages [36.31301773167754]
We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages.
Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations.
This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set.
arXiv Detail & Related papers (2024-02-21T07:05:51Z) - Transfer Learning across Several Centuries: Machine and Historian
Integrated Method to Decipher Royal Secretary's Diary [1.105375732595832]
NER in historical text has faced challenges such as scarcity of annotated corpus, multilanguage variety, various noise, and different convention far different from the contemporary language model.
This paper introduces Korean historical corpus (Diary of Royal secretary which is named SeungJeongWon) recorded over several centuries and recently added with named entity information as well as phrase markers which historians carefully annotated.
arXiv Detail & Related papers (2023-06-26T11:00:35Z) - Simple Yet Effective Neural Ranking and Reranking Baselines for
Cross-Lingual Information Retrieval [50.882816288076725]
Cross-lingual information retrieval is the task of searching documents in one language with queries in another.
We provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold.
We implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese.
arXiv Detail & Related papers (2023-04-03T14:17:00Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Assessing Neural Referential Form Selectors on a Realistic Multilingual
Dataset [6.651864489482537]
We build a dataset based on the OntoNotes corpus that contains a broader range of referring expression (RE) use in both English and Chinese.
We build neural Referential Form Selection (RFS) models accordingly, assess them on the dataset and conduct probing experiments.
arXiv Detail & Related papers (2022-10-10T16:42:25Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation
Extraction [15.649929244635269]
We propose a new dataset, DiS-ReX, which alleviates these issues.
Our dataset has more than 1.5 million sentences, spanning across 4 languages with 36 relation classes + 1 no relation (NA) class.
We also modify the widely used bag attention models by encoding sentences using mBERT and provide the first benchmark results on multilingual DS-RE.
arXiv Detail & Related papers (2021-04-17T22:44:38Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.