DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition
- URL: http://arxiv.org/abs/2504.04616v1
- Date: Sun, 06 Apr 2025 20:54:42 GMT
- Title: DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition
- Authors: Qi Zhang, Huitong Pan, Zhijia Chen, Longin Jan Latecki, Cornelia Caragea, Eduard Dragut,
- Abstract summary: We propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses.<n>We also introduce an automatic threshold estimation strategy to locate the errors in distant labels.<n>Our method outperforms numerous advanced DS-NER approaches across four datasets.
- Score: 49.54155332262579
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distantly Supervised Named Entity Recognition (DS-NER) has attracted attention due to its scalability and ability to automatically generate labeled data. However, distant annotation introduces many mislabeled instances, limiting its performance. Most of the existing work attempt to solve this problem by developing intricate models to learn from the noisy labels. An alternative approach is to attempt to clean the labeled data, thus increasing the quality of distant labels. This approach has received little attention for NER. In this paper, we propose a training dynamics-based label cleaning approach, which leverages the behavior of a model as training progresses to characterize the distantly annotated samples. We also introduce an automatic threshold estimation strategy to locate the errors in distant labels. Extensive experimental results demonstrate that: (1) models trained on our cleaned DS-NER datasets, which were refined by directly removing identified erroneous annotations, achieve significant improvements in F1-score, ranging from 3.18% to 8.95%; and (2) our method outperforms numerous advanced DS-NER approaches across four datasets.
Related papers
- Efficient Adaptive Label Refinement for Label Noise Learning [14.617885790129336]
We propose Adaptive Label Refinement (ALR) to avoid incorrect labels and thoroughly learning clean samples.<n>ALR is simple and efficient, requiring no prior knowledge of noise or auxiliary datasets.<n>We validate ALR's effectiveness through experiments on benchmark datasets with artificial label noise (CIFAR-10/100) and real-world datasets with inherent noise (ANIMAL-10N, Clothing1M, WebVision)
arXiv Detail & Related papers (2025-02-01T09:58:08Z) - Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised Pretraining with Pseudo-Label Refinement [3.272177633069322]
Real-world datasets often contain noisy labels due to human error, ambiguity, or resource constraints during the annotation process.<n>We propose a novel framework that combines self-supervised learning using SimCLR with iterative pseudo-label refinement.<n>Our approach significantly outperforms several state-of-the-art methods, particularly under high noise conditions.
arXiv Detail & Related papers (2024-12-06T09:56:49Z) - Re-Examine Distantly Supervised NER: A New Benchmark and a Simple Approach [14.801411392475439]
We introduce a real-life DS-NER dataset, QTL, where the training data is annotated using domain dictionaries and the test data is annotated by domain experts.<n>Existing DS-NER approaches fail when applied to QTL, which motivate us to re-examine existing DS-NER approaches.<n>We propose a new approach, token-level Curriculum-based Positive-Unlabeled Learning (CuPUL), which uses curriculum learning to order training samples from easy to hard.
arXiv Detail & Related papers (2024-02-22T20:07:02Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Learning to Detect Noisy Labels Using Model-Based Features [16.681748918518075]
We propose Selection-Enhanced Noisy label Training (SENT)
SENT does not rely on meta learning while having the flexibility of being data-driven.
It improves performance over strong baselines under the settings of self-training and label corruption.
arXiv Detail & Related papers (2022-12-28T10:12:13Z) - Unsupervised Domain Adaptive Salient Object Detection Through
Uncertainty-Aware Pseudo-Label Learning [104.00026716576546]
We propose to learn saliency from synthetic but clean labels, which naturally has higher pixel-labeling quality without the effort of manual annotations.
We show that our proposed method outperforms the existing state-of-the-art deep unsupervised SOD methods on several benchmark datasets.
arXiv Detail & Related papers (2022-02-26T16:03:55Z) - Distantly-Supervised Named Entity Recognition with Noise-Robust Learning
and Language Model Augmented Self-Training [66.80558875393565]
We study the problem of training named entity recognition (NER) models using only distantly-labeled data.
We propose a noise-robust learning scheme comprised of a new loss function and a noisy label removal step.
Our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
arXiv Detail & Related papers (2021-09-10T17:19:56Z) - Boosting Semi-Supervised Face Recognition with Noise Robustness [54.342992887966616]
This paper presents an effective solution to semi-supervised face recognition that is robust to the label noise aroused by the auto-labelling.
We develop a semi-supervised face recognition solution, named Noise Robust Learning-Labelling (NRoLL), which is based on the robust training ability empowered by GN.
arXiv Detail & Related papers (2021-05-10T14:43:11Z) - BOND: BERT-Assisted Open-Domain Named Entity Recognition with Distant
Supervision [49.42215511723874]
We propose a new computational framework -- BOND -- to improve the prediction performance of NER models.
Specifically, we propose a two-stage training algorithm: In the first stage, we adapt the pre-trained language model to the NER tasks using the distant labels.
In the second stage, we drop the distant labels, and propose a self-training approach to further improve the model performance.
arXiv Detail & Related papers (2020-06-28T04:55:39Z) - Named Entity Recognition without Labelled Data: A Weak Supervision
Approach [23.05371427663683]
This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision.
The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain.
A sequence labelling model can finally be trained on the basis of this unified annotation.
arXiv Detail & Related papers (2020-04-30T12:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.