Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)
- URL: http://arxiv.org/abs/2407.06422v1
- Date: Mon, 8 Jul 2024 22:04:30 GMT
- Title: Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version)
- Authors: Yiming Zhu, Peixian Zhang, Ehsan-Ul Haq, Pan Hui, Gareth Tyson,
- Abstract summary: We investigate the extent to which ChatGPT can annotate data for social computing tasks.
ChatGPT exhibits promise in handling data annotation tasks, albeit with some challenges.
We propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task.
- Score: 26.643834593780007
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Harnessing the potential of large language models (LLMs) like ChatGPT can help address social challenges through inclusive, ethical, and sustainable means. In this paper, we investigate the extent to which ChatGPT can annotate data for social computing tasks, aiming to reduce the complexity and cost of undertaking web research. To evaluate ChatGPT's potential, we re-annotate seven datasets using ChatGPT, covering topics related to pressing social issues like COVID-19 misinformation, social bot deception, cyberbully, clickbait news, and the Russo-Ukrainian War. Our findings demonstrate that ChatGPT exhibits promise in handling these data annotation tasks, albeit with some challenges. Across the seven datasets, ChatGPT achieves an average annotation F1-score of 72.00%. Its performance excels in clickbait news annotation, correctly labeling 89.66% of the data. However, we also observe significant variations in performance across individual labels. Our study reveals predictable patterns in ChatGPT's annotation performance. Thus, we propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task. Researchers can use this to identify where ChatGPT might be suitable for their annotation requirements. We show that GPT-Rater effectively predicts ChatGPT's performance. It performs best on a clickbait headlines dataset by achieving an average F1-score of 95.00%. We believe that this research opens new avenues for analysis and can reduce barriers to engaging in social computing research.
Related papers
- Is ChatGPT the Future of Causal Text Mining? A Comprehensive Evaluation
and Analysis [8.031131164056347]
This study conducts comprehensive evaluations of ChatGPT's causal text mining capabilities.
We introduce a benchmark that extends beyond general English datasets.
We also provide an evaluation framework to ensure fair comparisons between ChatGPT and previous approaches.
arXiv Detail & Related papers (2024-02-22T12:19:04Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Evaluating ChatGPT's Information Extraction Capabilities: An Assessment
of Performance, Explainability, Calibration, and Faithfulness [18.945934162722466]
We focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks.
ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting.
ChatGPT provides high-quality and trustworthy explanations for its decisions.
arXiv Detail & Related papers (2023-04-23T12:33:18Z) - Can ChatGPT Reproduce Human-Generated Labels? A Study of Social
Computing Tasks [9.740764281808588]
ChatGPT has the potential to reproduce human-generated label annotations in social computing tasks.
We relabel five datasets covering stance detection (2x), sentiment analysis, hate speech, and bot detection.
Our results highlight that ChatGPT does have the potential to handle these data annotation tasks, although a number of challenges remain.
arXiv Detail & Related papers (2023-04-20T08:08:12Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Is ChatGPT A Good Keyphrase Generator? A Preliminary Study [51.863368917344864]
ChatGPT has recently garnered significant attention from the computational linguistics community.
We evaluate its performance in various aspects, including keyphrase generation prompts, keyphrase generation diversity, and long document understanding.
We find that ChatGPT performs exceptionally well on all six candidate prompts, with minor performance differences observed across the datasets.
arXiv Detail & Related papers (2023-03-23T02:50:38Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.