Different Tastes of Entities: Investigating Human Label Variation in
Named Entity Annotations
- URL: http://arxiv.org/abs/2402.01423v1
- Date: Fri, 2 Feb 2024 14:08:34 GMT
- Title: Different Tastes of Entities: Investigating Human Label Variation in
Named Entity Annotations
- Authors: Siyao Peng, Zihang Sun, Sebastian Loftus, Barbara Plank
- Abstract summary: This paper studies disagreements in expert-annotated named entity datasets for three languages: English, Danish, and Bavarian.
We show that text ambiguity and artificial guideline changes are dominant factors for diverse annotations among high-quality revisions.
- Score: 23.059491714512077
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Named Entity Recognition (NER) is a key information extraction task with a
long-standing tradition. While recent studies address and aim to correct
annotation errors via re-labeling efforts, little is known about the sources of
human label variation, such as text ambiguity, annotation error, or guideline
divergence. This is especially the case for high-quality datasets and beyond
English CoNLL03. This paper studies disagreements in expert-annotated named
entity datasets for three languages: English, Danish, and Bavarian. We show
that text ambiguity and artificial guideline changes are dominant factors for
diverse annotations among high-quality revisions. We survey student annotations
on a subset of difficult entities and substantiate the feasibility and
necessity of manifold annotations for understanding named entity ambiguities
from a distributional perspective.
Related papers
- We're Afraid Language Models Aren't Modeling Ambiguity [136.8068419824318]
Managing ambiguity is a key part of human language understanding.
We characterize ambiguity in a sentence by its effect on entailment relations with another sentence.
We show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity.
arXiv Detail & Related papers (2023-04-27T17:57:58Z) - Author Name Disambiguation via Heterogeneous Network Embedding from
Structural and Semantic Perspectives [13.266320447769564]
Name ambiguity is common in academic digital libraries, such as multiple authors having the same name.
The proposed method is mainly based on representation learning for heterogeneous networks and clustering.
The semantic representation is generated using NLP tools.
arXiv Detail & Related papers (2022-12-24T11:22:34Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - ezCoref: Towards Unifying Annotation Guidelines for Coreference
Resolution [28.878540389202367]
We develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial.
We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets.
Surprisingly, we find that reasonable quality annotations were already achievable (>90% agreement between the crowd and expert annotations) even without extensive training.
arXiv Detail & Related papers (2022-10-13T17:09:59Z) - Entity Disambiguation with Entity Definitions [50.01142092276296]
Local models have recently attained astounding performances in Entity Disambiguation (ED)
Previous works limited their studies to using, as the textual representation of each candidate, only its Wikipedia title.
In this paper, we address this limitation and investigate to what extent more expressive textual representations can mitigate it.
We report a new state of the art on 2 out of 6 benchmarks we consider and strongly improve the generalization capability over unseen patterns.
arXiv Detail & Related papers (2022-10-11T17:46:28Z) - Monolingual alignment of word senses and definitions in lexicographical
resources [0.0]
The focus of this thesis is broadly on the alignment of lexicographical data, particularly dictionaries.
The first task aims to find an optimal alignment given the sense definitions of a headword in two different monolingual dictionaries.
This benchmark can be used for evaluation purposes of word-sense alignment systems.
arXiv Detail & Related papers (2022-09-06T13:09:52Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Knowledge-Rich Self-Supervised Entity Linking [58.838404666183656]
Knowledge-RIch Self-Supervision ($tt KRISSBERT$) is a universal entity linker for four million UMLS entities.
Our approach subsumes zero-shot and few-shot methods, and can easily incorporate entity descriptions and gold mention labels if available.
Without using any labeled information, our method produces $tt KRISSBERT$, a universal entity linker for four million UMLS entities.
arXiv Detail & Related papers (2021-12-15T05:05:12Z) - Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks [17.033055327465238]
We propose two contrasting paradigms for data annotation.
The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it.
We argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset.
arXiv Detail & Related papers (2021-12-14T15:38:22Z) - DaN+: Danish Nested Named Entities and Lexical Normalization [18.755176247223616]
This paper introduces DaN+, a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization.
We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task.
Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexical normalization are the most beneficial on the least canonical data.
arXiv Detail & Related papers (2021-05-24T14:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.