Related papers: A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data

A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data

URL: http://arxiv.org/abs/2512.00673v1
Date: Sat, 29 Nov 2025 23:59:58 GMT
Title: A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data
Authors: Breanna E. Green, Ashley L. Shea, Pengfei Zhao, Drew B. Margolin,
Abstract summary: We measure the performance of GPT-4 on one task and compare results to human annotators.<n>We craft four prompt styles as input and evaluate precision, recall, and F1 scores.<n>Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.
Score: 7.492722530877262
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative artificial intelligence tools, like ChatGPT, are an increasingly utilized resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Qualitative analysis reveals four specific findings. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.

Related papers

Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond [3.615835506868351]
We focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets.<n>We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty.
arXiv Detail & Related papers (2026-01-29T14:16:51Z)
On Fusing ChatGPT and Ensemble Learning in Discon-tinuous Named Entity Recognition in Health Corpora [0.0]
We investigate the integration of ChatGPT as an arbitrator within an ensemble method, aiming to enhance performance on DNER tasks.<n>Our method combines five state-of-the-art NER models with ChatGPT using custom prompt engineering to assess the robustness and generalization capabilities of the ensemble algorithm.<n>The results indicate that our proposed fusion of ChatGPT with the ensemble learning algorithm outperforms the SOTA results in the CADEC, ShARe13, and ShARe14 datasets.
arXiv Detail & Related papers (2024-12-22T11:26:49Z)
GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP [21.6253870440136]
This study conducts a large-scale automated and human evaluation of ChatGPT, encompassing 44 distinct language understanding and generation tasks. Our findings indicate that, despite its remarkable performance in English, ChatGPT is consistently surpassed by smaller models that have undergone finetuning on Arabic.
arXiv Detail & Related papers (2023-05-24T10:12:39Z)
ChatGraph: Interpretable Text Classification by Converting ChatGPT Knowledge to Graphs [54.48467003509595]
ChatGPT has shown superior performance in various natural language processing (NLP) tasks. We propose a novel framework that leverages the power of ChatGPT for specific tasks, such as text classification. Our method provides a more transparent decision-making process compared with previous text classification methods.
arXiv Detail & Related papers (2023-05-03T19:57:43Z)
Empirical Evaluation of ChatGPT on Requirements Information Retrieval Under Zero-Shot Setting [12.733403458944972]
We empirically evaluate ChatGPT's performance on requirements information retrieval tasks. Under zero-shot setting, evaluation results reveal ChatGPT's promising ability to retrieve requirements relevant information.
arXiv Detail & Related papers (2023-04-25T04:09:45Z)
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z)
ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP) This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources. Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z)
To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z)
Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences [6.821378903525802]
ChatGPT has consistently demonstrated a remarkable level of accuracy and reliability in terms of content evaluation. A test set consisting of prompts is created, covering a wide range of use cases, and five models are utilized to generate corresponding responses. Results on the test set show that ChatGPT's ranking preferences are consistent with human to a certain extent.
arXiv Detail & Related papers (2023-03-14T03:13:02Z)
Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z)
ChatGPT: Jack of all trades, master of none [4.693597927153063]
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) We examined ChatGPT's capabilities on 25 diverse analytical NLP tasks. We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses.
arXiv Detail & Related papers (2023-02-21T15:20:37Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.