Validating Label Consistency in NER Data Annotation
- URL: http://arxiv.org/abs/2101.08698v1
- Date: Thu, 21 Jan 2021 16:19:00 GMT
- Title: Validating Label Consistency in NER Data Annotation
- Authors: Qingkai Zeng, Mengxia Yu, Wenhao Yu, Tianwen Jiang, Tim Weninger and
Meng Jiang
- Abstract summary: In this work, we present an empirical method to explore the relationship between label (in-)consistency and NER model performance.
In experiments, our method identified the label inconsistency of test data in SCIERC and CoNLL03 datasets.
- Score: 34.24378200299595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data annotation plays a crucial role in ensuring your named entity
recognition (NER) projects are trained with the right information to learn
from. Producing the most accurate labels is a challenge due to the complexity
involved with annotation. Label inconsistency between multiple subsets of data
annotation (e.g., training set and test set, or multiple training subsets) is
an indicator of label mistakes. In this work, we present an empirical method to
explore the relationship between label (in-)consistency and NER model
performance. It can be used to validate the label consistency (or catches the
inconsistency) in multiple sets of NER data annotation. In experiments, our
method identified the label inconsistency of test data in SCIERC and CoNLL03
datasets (with 26.7% and 5.4% label mistakes). It validated the consistency in
the corrected version of both datasets.
Related papers
- Exploiting Conjugate Label Information for Multi-Instance Partial-Label Learning [61.00359941983515]
Multi-instance partial-label learning (MIPL) addresses scenarios where each training sample is represented as a multi-instance bag associated with a candidate label set containing one true label and several false positives.
ELIMIPL exploits the conjugate label information to improve the disambiguation performance.
arXiv Detail & Related papers (2024-08-26T15:49:31Z) - You can't handle the (dirty) truth: Data-centric insights improve pseudo-labeling [60.27812493442062]
We show the importance of investigating labeled data quality to improve any pseudo-labeling method.
Specifically, we introduce a novel data characterization and selection framework called DIPS to extend pseudo-labeling.
We demonstrate the applicability and impact of DIPS for various pseudo-labeling methods across an extensive range of real-world datasets.
arXiv Detail & Related papers (2024-06-19T17:58:40Z) - Don't Waste a Single Annotation: Improving Single-Label Classifiers
Through Soft Labels [7.396461226948109]
We address the limitations of the common data annotation and training methods for objective single-label classification tasks.
Our findings indicate that additional annotator information, such as confidence, secondary label and disagreement, can be used to effectively generate soft labels.
arXiv Detail & Related papers (2023-11-09T10:47:39Z) - FlatMatch: Bridging Labeled Data and Unlabeled Data with Cross-Sharpness
for Semi-Supervised Learning [73.13448439554497]
Semi-Supervised Learning (SSL) has been an effective way to leverage abundant unlabeled data with extremely scarce labeled data.
Most SSL methods are commonly based on instance-wise consistency between different data transformations.
We propose FlatMatch which minimizes a cross-sharpness measure to ensure consistent learning performance between the two datasets.
arXiv Detail & Related papers (2023-10-25T06:57:59Z) - Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object
Detection with Repeated Labels [6.872072177648135]
We propose a novel localization algorithm that adapts well-established ground truth estimation methods.
Our algorithm also shows superior performance during training on the TexBiG dataset.
arXiv Detail & Related papers (2023-09-18T13:08:44Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - Identifying Incorrect Annotations in Multi-Label Classification Data [14.94741409713251]
We consider algorithms for finding mislabeled examples in multi-label classification datasets.
We propose an extension of the Confident Learning framework to this setting, as well as a label quality score that ranks examples with label errors much higher than those which are correctly labeled.
arXiv Detail & Related papers (2022-11-25T05:03:56Z) - Detecting Label Errors in Token Classification Data [22.539748563923123]
We consider the task of finding sentences that contain label errors in token classification datasets.
We study 11 different straightforward methods that score tokens/sentences based on the predicted class probabilities.
We identify a simple and effective method that consistently detects those sentences containing label errors when applied with different token classification models.
arXiv Detail & Related papers (2022-10-08T05:14:22Z) - Acknowledging the Unknown for Multi-label Learning with Single Positive
Labels [65.5889334964149]
Traditionally, all unannotated labels are assumed as negative labels in single positive multi-label learning (SPML)
We propose entropy-maximization (EM) loss to maximize the entropy of predicted probabilities for all unannotated labels.
Considering the positive-negative label imbalance of unannotated labels, we propose asymmetric pseudo-labeling (APL) with asymmetric-tolerance strategies and a self-paced procedure to provide more precise supervision.
arXiv Detail & Related papers (2022-03-30T11:43:59Z) - Enhancing Label Correlation Feedback in Multi-Label Text Classification
via Multi-Task Learning [6.1538971100140145]
We introduce a novel approach with multi-task learning to enhance label correlation feedback.
We propose two auxiliary label co-occurrence prediction tasks to enhance label correlation learning.
arXiv Detail & Related papers (2021-06-06T12:26:14Z) - Federated Semi-Supervised Learning with Inter-Client Consistency &
Disjoint Learning [78.88007892742438]
We study two essential scenarios of Federated Semi-Supervised Learning (FSSL) based on the location of the labeled data.
We propose a novel method to tackle the problems, which we refer to as Federated Matching (FedMatch)
arXiv Detail & Related papers (2020-06-22T09:43:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.