Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances
- URL: http://arxiv.org/abs/2310.16790v1
- Date: Wed, 25 Oct 2023 17:23:37 GMT
- Title: Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances
- Authors: Zhendong Chu, Ruiyi Zhang, Tong Yu, Rajiv Jain, Vlad I Morariu,
Jiuxiang Gu, Ani Nenkova
- Abstract summary: We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
- Score: 55.37242480995541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To achieve state-of-the-art performance, one still needs to train NER models
on large-scale, high-quality annotated data, an asset that is both costly and
time-intensive to accumulate. In contrast, real-world applications often resort
to massive low-quality labeled data through non-expert annotators via
crowdsourcing and external knowledge bases via distant supervision as a
cost-effective alternative. However, these annotation methods result in noisy
labels, which in turn lead to a notable decline in performance. Hence, we
propose to denoise the noisy NER data with guidance from a small set of clean
instances. Along with the main NER model we train a discriminator model and use
its outputs to recalibrate the sample weights. The discriminator is capable of
detecting both span and category errors with different discriminative prompts.
Results on public crowdsourcing and distant supervision datasets show that the
proposed method can consistently improve performance with a small guidance set.
Related papers
- Re-Examine Distantly Supervised NER: A New Benchmark and a Simple
Approach [15.87963432758696]
We critically assess the efficacy of current DS-NER methodologies using a real-world benchmark dataset named QTL.
To tackle the prevalent issue of label noise, we introduce a simple yet effective approach, Curriculum-based Positive-Unlabeled Learning CuPUL.
Our empirical results highlight the capability of CuPUL to significantly reduce the impact of noisy labels and outperform existing methods.
arXiv Detail & Related papers (2024-02-22T20:07:02Z) - Learning to Detect Noisy Labels Using Model-Based Features [16.681748918518075]
We propose Selection-Enhanced Noisy label Training (SENT)
SENT does not rely on meta learning while having the flexibility of being data-driven.
It improves performance over strong baselines under the settings of self-training and label corruption.
arXiv Detail & Related papers (2022-12-28T10:12:13Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Feature Diversity Learning with Sample Dropout for Unsupervised Domain
Adaptive Person Re-identification [0.0]
This paper proposes a new approach to learn the feature representation with better generalization ability through limiting noisy pseudo labels.
We put forward a brand-new method referred as to Feature Diversity Learning (FDL) under the classic mutual-teaching architecture.
Experimental results show that our proposed FDL-SD achieves the state-of-the-art performance on multiple benchmark datasets.
arXiv Detail & Related papers (2022-01-25T10:10:48Z) - Distantly-Supervised Named Entity Recognition with Noise-Robust Learning
and Language Model Augmented Self-Training [66.80558875393565]
We study the problem of training named entity recognition (NER) models using only distantly-labeled data.
We propose a noise-robust learning scheme comprised of a new loss function and a noisy label removal step.
Our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
arXiv Detail & Related papers (2021-09-10T17:19:56Z) - WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD)
An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images.
The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z) - Tackling Instance-Dependent Label Noise via a Universal Probabilistic
Model [80.91927573604438]
This paper proposes a simple yet universal probabilistic model, which explicitly relates noisy labels to their instances.
Experiments on datasets with both synthetic and real-world label noise verify that the proposed method yields significant improvements on robustness.
arXiv Detail & Related papers (2021-01-14T05:43:51Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Meta-Learning for Neural Relation Classification with Distant
Supervision [38.755055486296435]
We propose a meta-learning based approach, which learns to reweight noisy training data under the guidance of reference data.
Experiments on several datasets demonstrate that the reference data can effectively guide the selection of training data.
arXiv Detail & Related papers (2020-10-26T12:52:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.