Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media
- URL: http://arxiv.org/abs/2502.01658v1
- Date: Fri, 31 Jan 2025 20:35:30 GMT
- Title: Large Language Models' Accuracy in Emulating Human Experts' Evaluation of Public Sentiments about Heated Tobacco Products on Social Media
- Authors: Kwanho Kim, Soojong Kim,
- Abstract summary: Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process.
This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs)
LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts.
- Score: 2.07180164747172
- License:
- Abstract: Sentiment analysis of alternative tobacco products on social media is important for tobacco control research. Large Language Models (LLMs) can help streamline the labor-intensive human sentiment analysis process. This study examined the accuracy of LLMs in replicating human sentiment evaluation of social media messages about heated tobacco products (HTPs). The research used GPT-3.5 and GPT-4 Turbo to classify 500 Facebook and 500 Twitter messages, including anti-HTPs, pro-HTPs, and neutral messages. The models evaluated each message up to 20 times, and their majority label was compared to human evaluators. Results showed that GPT-3.5 accurately replicated human sentiment 61.2% of the time for Facebook messages and 57.0% for Twitter messages. GPT-4 Turbo performed better, with 81.7% accuracy for Facebook and 77.0% for Twitter. Using three response instances, GPT-4 Turbo achieved 99% of the accuracy of twenty instances. GPT-4 Turbo also had higher accuracy for anti- and pro-HTPs messages compared to neutral ones. Misclassifications by GPT-3.5 often involved anti- or pro-HTPs messages being labeled as neutral or irrelevant, while GPT-4 Turbo showed improvements across all categories. In conclusion, LLMs can be used for sentiment analysis of HTP-related social media messages, with GPT-4 Turbo reaching around 80% accuracy compared to human experts. However, there's a risk of misrepresenting overall sentiment due to differences in accuracy across sentiment categories.
Related papers
- Classification performance and reproducibility of GPT-4 omni for information extraction from veterinary electronic health records [0.0]
This study compares the performance of GPT-4 omni (GPT-4o) and GPT-3.5 Turbo under different conditions.
Using GPT-4o to automate information extraction from veterinary EHRs is a viable alternative to manual extraction.
arXiv Detail & Related papers (2024-09-09T21:55:15Z) - If in a Crowdsourced Data Annotation Pipeline, a GPT-4 [12.898580978312848]
This paper compared GPT-4 and an ethical and well-executed MTurk pipeline.
Despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%.
arXiv Detail & Related papers (2024-02-26T18:08:52Z) - Behind the Screen: Investigating ChatGPT's Dark Personality Traits and
Conspiracy Beliefs [0.0]
This paper analyzes the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4.
Dark personality traits and conspiracy beliefs were not particularly pronounced in either model.
arXiv Detail & Related papers (2024-02-06T16:03:57Z) - Is ChatGPT Involved in Texts? Measure the Polish Ratio to Detect
ChatGPT-Generated Text [48.36706154871577]
We introduce a novel dataset termed HPPT (ChatGPT-polished academic abstracts)
It diverges from extant corpora by comprising pairs of human-written and ChatGPT-polished abstracts instead of purely ChatGPT-generated texts.
We also propose the "Polish Ratio" method, an innovative measure of the degree of modification made by ChatGPT compared to the original human-written text.
arXiv Detail & Related papers (2023-07-21T06:38:37Z) - How is ChatGPT's behavior changing over time? [72.79311931941876]
We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
arXiv Detail & Related papers (2023-07-18T06:56:08Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Evaluation of GPT and BERT-based models on identifying protein-protein
interactions in biomedical text [1.3923237289777164]
Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks.
We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora.
arXiv Detail & Related papers (2023-03-30T22:06:10Z) - Humans in Humans Out: On GPT Converging Toward Common Sense in both
Success and Failure [0.0]
GPT-3, GPT-3.5, and GPT-4 were trained on large quantities of human-generated text.
We show that GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples.
Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4.
arXiv Detail & Related papers (2023-03-30T10:32:18Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.