If in a Crowdsourced Data Annotation Pipeline, a GPT-4
- URL: http://arxiv.org/abs/2402.16795v2
- Date: Fri, 28 Jun 2024 19:33:48 GMT
- Title: If in a Crowdsourced Data Annotation Pipeline, a GPT-4
- Authors: Zeyu He, Chieh-Yang Huang, Chien-Kuang Cornelia Ding, Shaurya Rohatgi, Ting-Hao 'Kenneth' Huang,
- Abstract summary: This paper compared GPT-4 and an ethical and well-executed MTurk pipeline.
Despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%.
- Score: 12.898580978312848
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.
Related papers
- Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering [5.160473221022088]
This study explores the feasibility of using large language models (LLMs) for automated grading of conceptual questions.
We compared the grading performance of GPT-4o with that of human teaching assistants (TAs) on ten quiz problems from the MEEN 361 course at Texas A&M University.
Our analysis reveals that GPT-4o performs well when grading criteria are straightforward but struggles with nuanced answers.
arXiv Detail & Related papers (2024-11-06T04:41:13Z) - Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models [2.186740861187042]
Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets.
This paper investigates the potential of large language models (LLMs) to improve adherence to metadata standards.
We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository.
arXiv Detail & Related papers (2024-04-08T22:29:53Z) - A comparison of Human, GPT-3.5, and GPT-4 Performance in a University-Level Coding Course [0.0]
This study evaluates the performance of ChatGPT variants, GPT-3.5 and GPT-4, against solely student work and a mixed category containing both student and GPT-4 contributions in university-level physics coding assignments using the Python language.
Students averaged 91.9% (SE:0.4), surpassing the highest performing AI submission category, GPT-4 with prompt engineering, which scored 81.1% (SE:0.8) - a statistically significant difference (p = $2.482 times 10-10$)
The blinded markers were tasked with guessing the authorship of the submissions on a four-point Likert scale from Definitely
arXiv Detail & Related papers (2024-03-25T17:41:02Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - Automated title and abstract screening for scoping reviews using the
GPT-4 Large Language Model [0.0]
GPTscreenR is a package for the R statistical programming language that uses the GPT-4 Large Language Model (LLM) to automatically screen sources.
In validation against consensus human reviewer decisions, GPTscreenR performed similarly to an alternative zero-shot technique, with a sensitivity of 71%, specificity of 89%, and overall accuracy of 84%.
arXiv Detail & Related papers (2023-11-14T05:30:43Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP)
It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers.
We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Learning to Purify Noisy Labels via Meta Soft Label Corrector [49.92310583232323]
Recent deep neural networks (DNNs) can easily overfit to biased training data with noisy labels.
Label correction strategy is commonly used to alleviate this issue.
We propose a meta-learning model which could estimate soft labels through meta-gradient descent step.
arXiv Detail & Related papers (2020-08-03T03:25:17Z) - Semi-Supervised Learning with Data Augmentation for End-to-End ASR [4.878819328459915]
We focus on the consistency regularization principle, which has been successfully applied to image classification tasks.
We present sequence-to-sequence (seq2seq) versions of the FixMatch and Noisy Student algorithms.
arXiv Detail & Related papers (2020-07-27T21:24:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.