Improving Named Entity Recognition in Telephone Conversations via
Effective Active Learning with Human in the Loop
- URL: http://arxiv.org/abs/2211.01354v1
- Date: Wed, 2 Nov 2022 17:55:04 GMT
- Title: Improving Named Entity Recognition in Telephone Conversations via
Effective Active Learning with Human in the Loop
- Authors: Md Tahmid Rahman Laskar, Cheng Chen, Xue-Yong Fu, Shashi Bhushan TN
- Abstract summary: We present an active learning framework that leverages human in the loop learning to identify data samples from the annotated dataset for re-annotation.
By re-annotating only about 6% training instances out of the whole dataset, the F1 score for a certain entity type can be significantly improved by about 25%.
- Score: 2.1004132913758267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Telephone transcription data can be very noisy due to speech recognition
errors, disfluencies, etc. Not only that annotating such data is very
challenging for the annotators, but also such data may have lots of annotation
errors even after the annotation job is completed, resulting in a very poor
model performance. In this paper, we present an active learning framework that
leverages human in the loop learning to identify data samples from the
annotated dataset for re-annotation that are more likely to contain annotation
errors. In this way, we largely reduce the need for data re-annotation for the
whole dataset. We conduct extensive experiments with our proposed approach for
Named Entity Recognition and observe that by re-annotating only about 6%
training instances out of the whole dataset, the F1 score for a certain entity
type can be significantly improved by about 25%.
Related papers
- dopanim: A Dataset of Doppelganger Animals with Noisy Annotations from Multiple Humans [1.99197168821625]
We introduce a novel benchmark dataset, dopanim, consisting of about 15,750 animal images of 15 classes with ground truth labels.
For approximately 10,500 of these images, 20 humans provided over 52,000 annotations with an accuracy of circa 67%.
We benchmark well-known multi-annotator learning approaches using seven variants of this dataset and outline further evaluation use cases such as learning beyond hard class labels and active learning.
arXiv Detail & Related papers (2024-07-30T16:27:51Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models [25.893228797735908]
This study focuses on the credibility of real-world datasets, including the popular benchmarks Jigsaw Civil Comments, Anthropic Harmless & Red Team, PKU BeaverTails & SafeRLHF.
Given the cost and difficulty of cleaning these datasets by humans, we introduce a systematic framework for evaluating the credibility of datasets.
We find and fix an average of 6.16% label errors in 11 datasets constructed from the above benchmarks.
arXiv Detail & Related papers (2023-11-19T02:34:12Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - Combining Public Human Activity Recognition Datasets to Mitigate Labeled
Data Scarcity [1.274578243851308]
We propose a novel strategy to combine publicly available datasets with the goal of learning a generalized HAR model.
Our experimental evaluation, which includes experimenting with different state-of-the-art neural network architectures, shows that combining public datasets can significantly reduce the number of labeled samples.
arXiv Detail & Related papers (2023-06-23T18:51:22Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Assisted Text Annotation Using Active Learning to Achieve High Quality
with Little Effort [9.379650501033465]
We propose a tool that enables researchers to create large, high-quality, annotated datasets with only a few manual annotations.
We combine an active learning (AL) approach with a pre-trained language model to semi-automatically identify annotation categories.
Our preliminary results show that employing AL strongly reduces the number of annotations for correct classification of even complex and subtle frames.
arXiv Detail & Related papers (2021-12-15T13:14:58Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - A Novel Perspective for Positive-Unlabeled Learning via Noisy Labels [49.990938653249415]
This research presents a methodology that assigns initial pseudo-labels to unlabeled data which is used as noisy-labeled data, and trains a deep neural network using the noisy-labeled data.
Experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-03-08T11:46:02Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.