KoCHET: a Korean Cultural Heritage corpus for Entity-related Tasks
- URL: http://arxiv.org/abs/2209.00367v2
- Date: Fri, 2 Sep 2022 05:53:58 GMT
- Title: KoCHET: a Korean Cultural Heritage corpus for Entity-related Tasks
- Authors: Gyeongmin Kim, Jinsung Kim, Junyoung Son, Heuiseok Lim
- Abstract summary: KoCHET is a Korean cultural heritage corpus for the typical entity-related tasks.
It consists of 112,362, 38,765, 113,198 examples for NER, RE, and ET tasks.
Unlike the existing public corpora, modified redistribution can be allowed both domestic and foreign researchers.
- Score: 2.9439848714137447
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As digitized traditional cultural heritage documents have rapidly increased,
resulting in an increased need for preservation and management, practical
recognition of entities and typification of their classes has become essential.
To achieve this, we propose KoCHET - a Korean cultural heritage corpus for the
typical entity-related tasks, i.e., named entity recognition (NER), relation
extraction (RE), and entity typing (ET). Advised by cultural heritage experts
based on the data construction guidelines of government-affiliated
organizations, KoCHET consists of respectively 112,362, 38,765, 113,198
examples for NER, RE, and ET tasks, covering all entity types related to Korean
cultural heritage. Moreover, unlike the existing public corpora, modified
redistribution can be allowed both domestic and foreign researchers. Our
experimental results make the practical usability of KoCHET more valuable in
terms of cultural heritage. We also provide practical insights of KoCHET in
terms of statistical and linguistic analysis. Our corpus is freely available at
https://github.com/Gyeongmin47/KoCHET.
Related papers
- How Well Do LLMs Identify Cultural Unity in Diversity? [12.982460687543952]
We introduce a benchmark dataset for evaluating decoder-only large language models (LLMs) in understanding the cultural unity of concepts.
CUNIT consists of 1,425 evaluation examples building upon 285 traditional cultural-specific concepts across 10 countries.
We design a contrastive matching task to evaluate the LLMs' capability to identify highly associated cross-cultural concept pairs.
arXiv Detail & Related papers (2024-08-09T14:45:22Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies [53.2331634010413]
CultureBank is a knowledge base built upon users' self-narratives.
It contains 12K cultural descriptors sourced from TikTok and 11K from Reddit.
We offer recommendations for future culturally aware language technologies.
arXiv Detail & Related papers (2024-04-23T17:16:08Z) - CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean [18.526285276022907]
We introduce a benchmark of Cultural and Linguistic Intelligence in Korean dataset comprising 1,995 QA pairs.
CLIcK sources its data from official Korean exams and textbooks, partitioning the questions into eleven categories under the two main categories of language and culture.
Using CLIcK, we test 13 language models to assess their performance. Our evaluation uncovers insights into their performances across the categories, as well as the diverse factors affecting their comprehension.
arXiv Detail & Related papers (2024-03-11T03:54:33Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models [0.0]
We introduce the HAE-RAE Bench, a dataset curated to challenge models lacking Korean cultural and contextual depth.
The dataset encompasses six downstream tasks across four domains: vocabulary, history, general knowledge, and reading comprehension.
arXiv Detail & Related papers (2023-09-06T04:38:16Z) - HUE: Pretrained Model and Dataset for Understanding Hanja Documents of
Ancient Korea [59.35609710776603]
We release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks.
We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats.
arXiv Detail & Related papers (2022-10-11T03:04:28Z) - Geolocation of Cultural Heritage using Multi-View Knowledge Graph
Embedding [18.822364073669583]
We present a framework for ingesting knowledge about tangible cultural heritage entities.
We also propose a learning model for estimating the relative distance between a pair of cultural heritage entities.
arXiv Detail & Related papers (2022-09-08T08:32:34Z) - Entity Graph Extraction from Legal Acts -- a Prototype for a Use Case in
Policy Design Analysis [52.77024349608834]
This paper presents a prototype developed to serve the quantitative study of public policy design.
Our system aims to automate the process of gathering legal documents, annotating them with Institutional Grammar, and using hypergraphs to analyse inter-relations between crucial entities.
arXiv Detail & Related papers (2022-09-02T10:57:47Z) - WHOSe Heritage: Classification of UNESCO World Heritage "Outstanding
Universal Value" Documents with Smoothed Labels [1.6440434996206623]
This study applies state-of-the-art NLP models to build a classifier on a new real-world dataset containing official OUV justification statements.
Label smoothing is innovatively adapted to transform the task smoothly between multi-class and multi-label classification.
The study shows that the best models fine-tuned from BERT and ULMFiT can reach 94.3% top-3 accuracy.
arXiv Detail & Related papers (2021-04-12T15:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.