Cost-efficient Crowdsourcing for Span-based Sequence Labeling: Worker
Selection and Data Augmentation
- URL: http://arxiv.org/abs/2305.06683v1
- Date: Thu, 11 May 2023 09:40:24 GMT
- Title: Cost-efficient Crowdsourcing for Span-based Sequence Labeling: Worker
Selection and Data Augmentation
- Authors: Yujie Wang, Chao Huang, Liner Yang, Zhixuan Fang, Yaping Huang, Yang
Liu, Erhong Yang
- Abstract summary: This study contends with the complexities of label interdependencies in sequence labeling tasks.
The proposed algorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selection.
- Score: 26.462370031232314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a novel worker selection algorithm, enhancing
annotation quality and reducing costs in challenging span-based sequence
labeling tasks in Natural Language Processing (NLP). Unlike previous studies
targeting simpler tasks, this study contends with the complexities of label
interdependencies in sequence labeling tasks. The proposed algorithm utilizes a
Combinatorial Multi-Armed Bandit (CMAB) approach for worker selection. The
challenge of dealing with imbalanced and small-scale datasets, which hinders
offline simulation of worker selection, is tackled using an innovative data
augmentation method termed shifting, expanding, and shrinking (SES). The SES
method is designed specifically for sequence labeling tasks. Rigorous testing
on CoNLL 2003 NER and Chinese OEI datasets showcased the algorithm's
efficiency, with an increase in F1 score up to 100.04% of the expert-only
baseline, alongside cost savings up to 65.97%. The paper also encompasses a
dataset-independent test emulating annotation evaluation through a Bernoulli
distribution, which still led to an impressive 97.56% F1 score of the expert
baseline and 59.88% cost savings. This research addresses and overcomes
numerous obstacles in worker selection for complex NLP tasks.
Related papers
- TAGCOS: Task-agnostic Gradient Clustered Coreset Selection for Instruction Tuning Data [29.45013725650798]
It is essential to extract a subset of instruction datasets that achieves comparable performance to the full dataset.
We propose Task-Agnostic Gradient Clustered COreset Selection (TAGCOS)
Specifically, we leverage sample gradients as the data representations, perform clustering to group similar data, and apply an efficient greedy algorithm for coreset selection.
arXiv Detail & Related papers (2024-07-21T17:59:20Z) - End-to-End Trainable Soft Retriever for Low-resource Relation Extraction [7.613942320502336]
This study addresses a crucial challenge in instance-based relation extraction using text generation models.
We propose a novel End-to-end TRAinable Soft K-nearest neighbor retriever (ETRASK) by the neural prompting method.
arXiv Detail & Related papers (2024-06-06T07:01:50Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Bandit-Driven Batch Selection for Robust Learning under Label Noise [20.202806541218944]
We introduce a novel approach for batch selection in Gradient Descent (SGD) training, leveraging bandit algorithms.
Our methodology focuses on optimizing the learning process in the presence of label noise, a prevalent issue in real-world datasets.
arXiv Detail & Related papers (2023-10-31T19:19:01Z) - Fake detection in imbalance dataset by Semi-supervised learning with GAN [1.4542411354617986]
Our study contributes to the field by achieving an 81% accuracy in detecting fake accounts using only 100 labeled samples.
This demonstrates the potential of SGAN as a powerful tool for handling minority classes and addressing big data challenges in fake account detection.
arXiv Detail & Related papers (2022-12-02T10:22:18Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - UNICON: Combating Label Noise Through Uniform Selection and Contrastive
Learning [89.56465237941013]
We propose UNICON, a simple yet effective sample selection method which is robust to high label noise.
We obtain an 11.4% improvement over the current state-of-the-art on CIFAR100 dataset with a 90% noise rate.
arXiv Detail & Related papers (2022-03-28T07:36:36Z) - Truth Discovery in Sequence Labels from Crowds [12.181422057560201]
Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose.
Existing literature in annotation aggregation assumes that annotations are independent and thus faces challenges when handling the sequential label aggregation tasks.
We propose an optimization-based method that infers the ground truth labels using annotations provided by workers for sequential labeling tasks.
arXiv Detail & Related papers (2021-09-09T19:12:13Z) - An Empirical Survey of Data Augmentation for Limited Data Learning in
NLP [88.65488361532158]
dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks.
Data augmentation methods have been explored as a means of improving data efficiency in NLP.
We provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting.
arXiv Detail & Related papers (2021-06-14T15:27:22Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.