A Comparative Study on Annotation Quality of Crowdsourcing and LLM via
Label Aggregation
- URL: http://arxiv.org/abs/2401.09760v1
- Date: Thu, 18 Jan 2024 07:23:51 GMT
- Title: A Comparative Study on Annotation Quality of Crowdsourcing and LLM via
Label Aggregation
- Authors: Jiyi Li
- Abstract summary: We investigate which existing crowdsourcing datasets can be used for a comparative study and create a benchmark.
We then compare the quality between individual crowd labels and LLM labels and make the evaluations on the aggregated labels.
We find that adding LLM labels from good LLMs to existing crowdsourcing datasets can enhance the quality of the aggregated labels.
- Score: 6.871295804618002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whether Large Language Models (LLMs) can outperform crowdsourcing on the data
annotation task is attracting interest recently. Some works verified this issue
with the average performance of individual crowd workers and LLM workers on
some specific NLP tasks by collecting new datasets. However, on the one hand,
existing datasets for the studies of annotation quality in crowdsourcing are
not yet utilized in such evaluations, which potentially provide reliable
evaluations from a different viewpoint. On the other hand, the quality of these
aggregated labels is crucial because, when utilizing crowdsourcing, the
estimated labels aggregated from multiple crowd labels to the same instances
are the eventually collected labels. Therefore, in this paper, we first
investigate which existing crowdsourcing datasets can be used for a comparative
study and create a benchmark. We then compare the quality between individual
crowd labels and LLM labels and make the evaluations on the aggregated labels.
In addition, we propose a Crowd-LLM hybrid label aggregation method and verify
the performance. We find that adding LLM labels from good LLMs to existing
crowdsourcing datasets can enhance the quality of the aggregated labels of the
datasets, which is also higher than the quality of LLM labels themselves.
Related papers
- Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance [21.926934384262594]
Large language models (LLMs) offer new opportunities to enhance the annotation process.
We compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency.
Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance.
arXiv Detail & Related papers (2024-10-24T16:27:03Z) - Learning to Predict Usage Options of Product Reviews with LLM-Generated Labels [14.006486214852444]
We propose a method of using LLMs as few-shot learners for annotating data in a complex natural language task.
Learning a custom model offers individual control over energy efficiency and privacy measures.
We find that the quality of the resulting data exceeds the level attained by third-party vendor services.
arXiv Detail & Related papers (2024-10-16T11:34:33Z) - Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels.
This study explores whether solely utilizing unlabeled data can elicit strong model capabilities.
We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z) - Exploiting Conjugate Label Information for Multi-Instance Partial-Label Learning [61.00359941983515]
Multi-instance partial-label learning (MIPL) addresses scenarios where each training sample is represented as a multi-instance bag associated with a candidate label set containing one true label and several false positives.
ELIMIPL exploits the conjugate label information to improve the disambiguation performance.
arXiv Detail & Related papers (2024-08-26T15:49:31Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - Binary Classification with Positive Labeling Sources [71.37692084951355]
We propose WEAPO, a simple yet competitive WS method for producing training labels without negative labeling sources.
We show WEAPO achieves the highest averaged performance on 10 benchmark datasets.
arXiv Detail & Related papers (2022-08-02T19:32:08Z) - Eliciting and Learning with Soft Labels from Every Annotator [31.10635260890126]
We focus on efficiently eliciting soft labels from individual annotators.
We demonstrate that learning with our labels achieves comparable model performance to prior approaches.
arXiv Detail & Related papers (2022-07-02T12:03:00Z) - An Empirical Investigation of Learning from Biased Toxicity Labels [15.822714574671412]
We study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels.
We evaluate the accuracy and fairness properties of these approaches, and trade-offs between the two.
arXiv Detail & Related papers (2021-10-04T17:19:57Z) - Bayesian Semi-supervised Crowdsourcing [71.20185379303479]
Crowdsourcing has emerged as a powerful paradigm for efficiently labeling large datasets and performing various learning tasks.
This work deals with semi-supervised crowdsourced classification, under two regimes of semi-supervision.
arXiv Detail & Related papers (2020-12-20T23:18:51Z) - An Empirical Study on Large-Scale Multi-Label Text Classification
Including Few and Zero-Shot Labels [49.036212158261215]
Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications.
Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs)
We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs.
We propose a new state-of-the-art method which combines BERT with LWANs.
arXiv Detail & Related papers (2020-10-04T18:55:47Z) - Generalized Label Enhancement with Sample Correlations [24.582764493585362]
We propose two novel label enhancement methods, i.e., Label Enhancement with Sample Correlations (LESC) and generalized Label Enhancement with Sample Correlations (gLESC)
Benefitting from the sample correlations, the proposed methods can boost the performance of label enhancement.
arXiv Detail & Related papers (2020-04-07T03:32:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.