Truth Discovery in Sequence Labels from Crowds
- URL: http://arxiv.org/abs/2109.04470v2
- Date: Sat, 1 Jul 2023 23:38:34 GMT
- Title: Truth Discovery in Sequence Labels from Crowds
- Authors: Nasim Sabetpour, Adithya Kulkarni, Sihong Xie, Qi Li
- Abstract summary: Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose.
Existing literature in annotation aggregation assumes that annotations are independent and thus faces challenges when handling the sequential label aggregation tasks.
We propose an optimization-based method that infers the ground truth labels using annotations provided by workers for sequential labeling tasks.
- Score: 12.181422057560201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Annotation quality and quantity positively affect the learning performance of
sequence labeling, a vital task in Natural Language Processing. Hiring domain
experts to annotate a corpus is very costly in terms of money and time.
Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been
deployed to assist in this purpose. However, the annotations collected this way
are prone to human errors due to the lack of expertise of the crowd workers.
Existing literature in annotation aggregation assumes that annotations are
independent and thus faces challenges when handling the sequential label
aggregation tasks with complex dependencies. To conquer the challenges, we
propose an optimization-based method that infers the ground truth labels using
annotations provided by workers for sequential labeling tasks. The proposed
Aggregation method for Sequential Labels from Crowds ($AggSLC$) jointly
considers the characteristics of sequential labeling tasks, workers'
reliabilities, and advanced machine learning techniques. Theoretical analysis
on the algorithm's convergence further demonstrates that the proposed $AggSLC$
halts after a finite number of iterations. We evaluate $AggSLC$ on different
crowdsourced datasets for Named Entity Recognition (NER) tasks and Information
Extraction tasks in biomedical (PICO), as well as a simulated dataset. Our
results show that the proposed method outperforms the state-of-the-art
aggregation methods. To achieve insights into the framework, we study the
effectiveness of $AggSLC$'s components through ablation studies.
Related papers
- Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression.
For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z) - Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object
Detection with Repeated Labels [6.872072177648135]
We propose a novel localization algorithm that adapts well-established ground truth estimation methods.
Our algorithm also shows superior performance during training on the TexBiG dataset.
arXiv Detail & Related papers (2023-09-18T13:08:44Z) - IDAS: Intent Discovery with Abstractive Summarization [16.731183915325584]
We show that recent competitive methods in intent discovery can be outperformed by clustering utterances based on abstractive summaries.
We contribute the IDAS approach, which collects a set of descriptive utterance labels by prompting a Large Language Model.
The utterances and their resulting noisy labels are then encoded by a frozen pre-trained encoder, and subsequently clustered to recover the latent intents.
arXiv Detail & Related papers (2023-05-31T12:19:40Z) - USB: A Unified Summarization Benchmark Across Tasks and Domains [68.82726887802856]
We introduce a Wikipedia-derived benchmark, complemented by a rich set of crowd-sourced annotations, that supports $8$ interrelated tasks.
We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models.
arXiv Detail & Related papers (2023-05-23T17:39:54Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Multi-View Knowledge Distillation from Crowd Annotations for
Out-of-Domain Generalization [53.24606510691877]
We propose new methods for acquiring soft-labels from crowd-annotations by aggregating the distributions produced by existing methods.
We demonstrate that these aggregation methods lead to the most consistent performance across four NLP tasks on out-of-domain test sets.
arXiv Detail & Related papers (2022-12-19T12:40:18Z) - Debiased Pseudo Labeling in Self-Training [77.83549261035277]
Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets.
To mitigate the requirement for labeled data, self-training is widely used in both academia and industry by pseudo labeling on readily-available unlabeled data.
We propose Debiased, in which the generation and utilization of pseudo labels are decoupled by two independent heads.
arXiv Detail & Related papers (2022-02-15T02:14:33Z) - Focusing on Potential Named Entities During Active Label Acquisition [0.0]
Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text.
Many domain-specific NER applications still call for a substantial amount of labeled data.
We propose a better data-driven normalization approach to penalize sentences that are too long or too short.
arXiv Detail & Related papers (2021-11-06T09:04:16Z) - Learning with Noisy Labels by Targeted Relabeling [52.0329205268734]
Crowdsourcing platforms are often used to collect datasets for training deep neural networks.
We propose an approach which reserves a fraction of annotations to explicitly relabel highly probable labeling errors.
arXiv Detail & Related papers (2021-10-15T20:37:29Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.