KoCoNovel: Annotated Dataset of Character Coreference in Korean Novels
- URL: http://arxiv.org/abs/2404.01140v2
- Date: Thu, 11 Apr 2024 14:57:10 GMT
- Title: KoCoNovel: Annotated Dataset of Character Coreference in Korean Novels
- Authors: Kyuhee Kim, Surin Lee, Sangah Lee,
- Abstract summary: KoCoNovel is a novel character coreference dataset derived from Korean literary texts.
One of KoCoNovel's distinctive features is that 24% of all character mentions are single common nouns.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present KoCoNovel, a novel character coreference dataset derived from Korean literary texts, complete with detailed annotation guidelines. Comprising 178K tokens from 50 modern and contemporary novels, KoCoNovel stands as one of the largest public coreference resolution corpora in Korean, and the first to be based on literary texts. KoCoNovel offers four distinct versions to accommodate a wide range of literary coreference analysis needs. These versions are designed to support perspectives of the omniscient author or readers, and to manage multiple entities as either separate or overlapping, thereby broadening its applicability. One of KoCoNovel's distinctive features is that 24% of all character mentions are single common nouns, lacking possessive markers or articles. This feature is particularly influenced by the nuances of Korean address term culture, which favors the use of terms denoting social relationships and kinship over personal names. In experiments with a BERT-based coreference model, we observe notable performance enhancements with KoCoNovel in character coreference tasks within literary texts, compared to a larger non-literary coreference dataset. Such findings underscore KoCoNovel's potential to significantly enhance coreference resolution models through the integration of Korean cultural and linguistic dynamics.
Related papers
- Co-DETECT: Collaborative Discovery of Edge Cases in Text Classification [89.62851347390959]
Co-DETECT (Collaborative Discovery of Edge cases in TExt ClassificaTion) is a novel mixed-initiative annotation framework.<n>It integrates human expertise with automatic annotation guided by large language models.
arXiv Detail & Related papers (2025-07-07T13:48:54Z) - Enriching the Korean Learner Corpus with Multi-reference Annotations and Rubric-Based Scoring [2.824980053889876]
We enhance the KoLLA Korean learner corpus by adding grammatical error correction references.
We enrich the corpus with rubric-based scores aligned with guidelines from the Korean National Language Institute.
arXiv Detail & Related papers (2025-05-01T03:04:07Z) - Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.
We incorporate additional contextual cues together with the signing video.
We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Multilingual Coreference Resolution in Low-resource South Asian Languages [36.31301773167754]
We introduce a Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages.
Nearly all of the predicted translations successfully pass a sanity check, and 75% of English references align with their predicted translations.
This study is the first to evaluate an end-to-end coreference resolution model on a Hindi golden set.
arXiv Detail & Related papers (2024-02-21T07:05:51Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - HistRED: A Historical Document-Level Relation Extraction Dataset [32.96963890713529]
HistRED is constructed from Yeonhaengnok, a collection of records originally written in Hanja, the classical Chinese writing.
HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts.
We propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities.
arXiv Detail & Related papers (2023-07-10T00:24:27Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - SenteCon: Leveraging Lexicons to Learn Human-Interpretable Language
Representations [51.08119762844217]
SenteCon is a method for introducing human interpretability in deep language representations.
We show that SenteCon provides high-level interpretability at little to no cost to predictive performance on downstream tasks.
arXiv Detail & Related papers (2023-05-24T05:06:28Z) - BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal
Reference Annotations [0.0]
We introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains.
This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens.
arXiv Detail & Related papers (2023-04-07T15:08:46Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - RuCoCo: a new Russian corpus with coreference annotation [69.3939291118954]
We present a new corpus with coreference annotation, Russian Coreference Corpus (RuCoCo)
RuCoCo contains news texts in Russian, part of which were annotated from scratch, and for the rest the machine-generated annotations were refined by human annotators.
The size of our corpus is one million words and around 150,000 mentions.
arXiv Detail & Related papers (2022-06-10T07:50:09Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - The Annotation Guideline of LST20 Corpus [0.3161954199291541]
The dataset complies to the CoNLL-2003-style format for ease of use.
At a large scale, it consists of 3,164,864 words, 288,020 named entities, 248,962 clauses, and 74,180 sentences.
All 3,745 documents are also annotated with 15 news genres.
arXiv Detail & Related papers (2020-08-12T01:16:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.