Are Chatbots Reliable Text Annotators? Sometimes
- URL: http://arxiv.org/abs/2311.05769v2
- Date: Tue, 25 Feb 2025 09:57:48 GMT
- Title: Are Chatbots Reliable Text Annotators? Sometimes
- Authors: Ross Deans Kristensen-McLachlan, Miceal Canavan, Márton Kardos, Mia Jacobsen, Lene Aarøe,
- Abstract summary: ChatGPT is a closed-source product which has major drawbacks with regards to transparency, cost, and data protection.<n>Recent advances in open-source (OS) large language models (LLMs) offer an alternative without these drawbacks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent research highlights the significant potential of ChatGPT for text annotation in social science research. However, ChatGPT is a closed-source product which has major drawbacks with regards to transparency, reproducibility, cost, and data protection. Recent advances in open-source (OS) large language models (LLMs) offer an alternative without these drawbacks. Thus, it is important to evaluate the performance of OS LLMs relative to ChatGPT and standard approaches to supervised machine learning classification. We conduct a systematic comparative evaluation of the performance of a range of OS LLMs alongside ChatGPT, using both zero- and few-shot learning as well as generic and custom prompts, with results compared to supervised classification models. Using a new dataset of tweets from US news media, and focusing on simple binary text annotation tasks, we find significant variation in the performance of ChatGPT and OS models across the tasks, and that the supervised classifier using DistilBERT generally outperforms both. Given the unreliable performance of ChatGPT and the significant challenges it poses to Open Science we advise caution when using ChatGPT for substantive text annotation tasks.
Related papers
- Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version) [26.643834593780007]
We investigate the extent to which ChatGPT can annotate data for social computing tasks.
ChatGPT exhibits promise in handling data annotation tasks, albeit with some challenges.
We propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task.
arXiv Detail & Related papers (2024-07-08T22:04:30Z) - Is ChatGPT the Future of Causal Text Mining? A Comprehensive Evaluation
and Analysis [8.031131164056347]
This study conducts comprehensive evaluations of ChatGPT's causal text mining capabilities.
We introduce a benchmark that extends beyond general English datasets.
We also provide an evaluation framework to ensure fair comparisons between ChatGPT and previous approaches.
arXiv Detail & Related papers (2024-02-22T12:19:04Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - Towards LLM-driven Dialogue State Tracking [13.679946384741008]
Large language models (LLMs) such as GPT3 and ChatGPT have sparked considerable interest in assessing their efficacy across diverse applications.
We present LDST, an LLM-driven Dialogue State Tracking framework based on smaller, open-source foundation models.
We find that LDST exhibits remarkable performance improvements in both zero-shot and few-shot setting compared to previous SOTA methods.
arXiv Detail & Related papers (2023-10-23T14:15:28Z) - A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets.
Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z) - Automatic Code Summarization via ChatGPT: How Far Are We? [10.692654700225411]
We evaluate ChatGPT on a widely-used Python dataset called CSN-Python.
In terms of BLEU and ROUGE-L, ChatGPT's code summarization performance is significantly worse than all three SOTA models.
Based on the findings, we outline several open challenges and opportunities in ChatGPT-based code summarization.
arXiv Detail & Related papers (2023-05-22T09:43:40Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries
Through Blinded Reviewers and Text Classification Algorithms [0.8339831319589133]
ChatGPT, developed by OpenAI, is a recent addition to the family of language models.
We evaluate the performance of ChatGPT on Abstractive Summarization by the means of automated metrics and blinded human reviewers.
arXiv Detail & Related papers (2023-03-30T18:28:33Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Exploring the Limits of ChatGPT for Query or Aspect-based Text
Summarization [28.104696513516117]
Large language models (LLMs) like GPT3 and ChatGPT have recently created significant interest in using these models for text summarization tasks.
Recent studies citegoyal2022news, zhang2023benchmarking have shown that LLMs-generated news summaries are already on par with humans.
Our experiments reveal that ChatGPT's performance is comparable to traditional fine-tuning methods in terms of Rouge scores.
arXiv Detail & Related papers (2023-02-16T04:41:30Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.