CHisIEC: An Information Extraction Corpus for Ancient Chinese History
- URL: http://arxiv.org/abs/2403.15088v2
- Date: Sat, 20 Apr 2024 08:46:12 GMT
- Title: CHisIEC: An Information Extraction Corpus for Ancient Chinese History
- Authors: Xuemei Tang, Zekun Deng, Qi Su, Hao Yang, Jun Wang,
- Abstract summary: We present the Chinese Historical Information Extraction Corpus''(CHis IEC) dataset.
CHis IEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks.
The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset.
- Score: 12.41912979618724
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Natural Language Processing (NLP) plays a pivotal role in the realm of Digital Humanities (DH) and serves as the cornerstone for advancing the structural analysis of historical and cultural heritage texts. This is particularly true for the domains of named entity recognition (NER) and relation extraction (RE). In our commitment to expediting ancient history and culture, we present the ``Chinese Historical Information Extraction Corpus''(CHisIEC). CHisIEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks, offering a resource to facilitate research in the field. Spanning a remarkable historical timeline encompassing data from 13 dynasties spanning over 1830 years, CHisIEC epitomizes the extensive temporal range and text heterogeneity inherent in Chinese historical documents. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset comprising 14,194 entities and 8,609 relations. To establish the robustness and versatility of our dataset, we have undertaken comprehensive experimentation involving models of various sizes and paradigms. Additionally, we have evaluated the capabilities of Large Language Models (LLMs) in the context of tasks related to ancient Chinese history. The dataset and code are available at \url{https://github.com/tangxuemei1995/CHisIEC}.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Unlocking Comics: The AI4VA Dataset for Visual Understanding [62.345344799258804]
This paper presents a novel dataset comprising Franco-Belgian comics from the 1950s annotated for tasks including depth estimation, semantic segmentation, saliency detection, and character identification.
It consists of two distinct and consistent styles and incorporates object concepts and labels taken from natural images.
By including such diverse information across styles, this dataset not only holds promise for computational creativity but also offers avenues for the digitization of art and storytelling innovation.
arXiv Detail & Related papers (2024-10-27T14:27:05Z) - A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - Contrastive Entity Coreference and Disambiguation for Historical Texts [2.446672595462589]
Existing entity disambiguation methods often fall short in accuracy for historical documents, which are replete with individuals not remembered in contemporary knowledgebases.
This study makes three key contributions to improve cross-document coreference resolution and disambiguation in historical texts.
arXiv Detail & Related papers (2024-06-21T18:22:14Z) - Surveying the Dead Minds: Historical-Psychological Text Analysis with
Contextualized Construct Representation (CCR) for Classical Chinese [4.772998830872483]
We develop a pipeline for historical-psychological text analysis in classical Chinese.
The pipeline combines expert knowledge in psychometrics with text representations generated via transformer-based language models.
Considering the scarcity of available data, we propose an indirect supervised contrastive learning approach.
arXiv Detail & Related papers (2024-03-01T13:14:45Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages.
These languages originate from five distinct language families and are predominantly spoken in Africa and Asia.
Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - ScrollTimes: Tracing the Provenance of Paintings as a Window into
History [35.605930297790465]
The study of cultural artifact provenance, tracing ownership and preservation, holds significant importance in archaeology and art history.
In collaboration with art historians, we examined the handscroll, a traditional Chinese painting form that provides a rich source of historical data.
We present a three-tiered methodology encompassing artifact, contextual, and provenance levels, designed to create a "Biography" for handscroll.
arXiv Detail & Related papers (2023-06-15T03:38:09Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Natural Language Inference with Self-Attention for Veracity Assessment
of Pandemic Claims [54.93898455714295]
We first describe the construction of the novel PANACEA dataset consisting of heterogeneous claims on COVID-19.
We then propose novel techniques for automated veracity assessment based on Natural Language Inference.
arXiv Detail & Related papers (2022-05-05T12:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.