Few Clean Instances Help Denoising Distant Supervision
- URL: http://arxiv.org/abs/2209.06596v1
- Date: Wed, 14 Sep 2022 12:29:57 GMT
- Title: Few Clean Instances Help Denoising Distant Supervision
- Authors: Yufang Liu, Ziyin Huang, Yijun Wang, Changzhi Sun, Man Lan, Yuanbin
Wu, Xiaofeng Mou and Ding Wang
- Abstract summary: We study whether a small clean dataset could help improve the quality of distantly supervised models.
We show that besides getting a more convincing evaluation of models, a small clean dataset also helps us to build more robust denoising models.
- Score: 28.336399223985175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing distantly supervised relation extractors usually rely on noisy data
for both model training and evaluation, which may lead to
garbage-in-garbage-out systems. To alleviate the problem, we study whether a
small clean dataset could help improve the quality of distantly supervised
models. We show that besides getting a more convincing evaluation of models, a
small clean dataset also helps us to build more robust denoising models.
Specifically, we propose a new criterion for clean instance selection based on
influence functions. It collects sample-level evidence for recognizing good
instances (which is more informative than loss-level evidence). We also propose
a teacher-student mechanism for controlling purity of intermediate results when
bootstrapping the clean set. The whole approach is model-agnostic and
demonstrates strong performances on both denoising real (NYT) and synthetic
noisy datasets.
Related papers
- Foster Adaptivity and Balance in Learning with Noisy Labels [26.309508654960354]
We propose a novel approach named textbfSED to deal with label noise in a textbfSelf-adaptivtextbfE and class-balancetextbfD manner.
A mean-teacher model is then employed to correct labels of noisy samples.
We additionally propose a self-adaptive and class-balanced sample re-weighting mechanism to assign different weights to detected noisy samples.
arXiv Detail & Related papers (2024-07-03T03:10:24Z) - Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Improving a Named Entity Recognizer Trained on Noisy Data with a Few
Clean Instances [55.37242480995541]
We propose to denoise noisy NER data with guidance from a small set of clean instances.
Along with the main NER model we train a discriminator model and use its outputs to recalibrate the sample weights.
Results on public crowdsourcing and distant supervision datasets show that the proposed method can consistently improve performance with a small guidance set.
arXiv Detail & Related papers (2023-10-25T17:23:37Z) - Fine tuning Pre trained Models for Robustness Under Noisy Labels [34.68018860186995]
The presence of noisy labels in a training dataset can significantly impact the performance of machine learning models.
We introduce a novel algorithm called TURN, which robustly and efficiently transfers the prior knowledge of pre-trained models.
arXiv Detail & Related papers (2023-10-24T20:28:59Z) - Combating Label Noise With A General Surrogate Model For Sample
Selection [84.61367781175984]
We propose to leverage the vision-language surrogate model CLIP to filter noisy samples automatically.
We validate the effectiveness of our proposed method on both real-world and synthetic noisy datasets.
arXiv Detail & Related papers (2023-10-16T14:43:27Z) - Understanding and Mitigating the Label Noise in Pre-training on
Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks.
We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z) - Improving Distantly Supervised Relation Extraction with Self-Ensemble
Noise Filtering [17.45521023572853]
We propose a self-ensemble filtering mechanism to filter out noisy samples during the training process.
Our experiments with multiple state-of-the-art relation extraction models show that our proposed filtering mechanism improves the robustness of the models and increases their F1 scores.
arXiv Detail & Related papers (2021-08-22T11:23:36Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z) - Deep k-NN for Noisy Labels [55.97221021252733]
We show that a simple $k$-nearest neighbor-based filtering approach on the logit layer of a preliminary model can remove mislabeled data and produce more accurate models than many recently proposed methods.
arXiv Detail & Related papers (2020-04-26T05:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.