In Generative AI we Trust: Can Chatbots Effectively Verify Political
Information?
- URL: http://arxiv.org/abs/2312.13096v1
- Date: Wed, 20 Dec 2023 15:17:03 GMT
- Title: In Generative AI we Trust: Can Chatbots Effectively Verify Political
Information?
- Authors: Elizaveta Kuznetsova, Mykola Makhortykh, Victoria Vziatysheva, Martha
Stolze, Ani Baghumyan, Aleksandra Urman
- Abstract summary: This article presents a comparative analysis of the ability of two large language model (LLM)-based chatbots, ChatGPT and Bing Chat, to detect veracity of political information.
We use AI auditing methodology to investigate how chatbots evaluate true, false, and borderline statements on five topics: COVID-19, Russian aggression against Ukraine, the Holocaust, climate change, and LGBTQ+ related debates.
The results show high performance of ChatGPT for the baseline veracity evaluation task, with 72 percent of the cases evaluated correctly on average across languages without pre-training.
- Score: 39.58317527488534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This article presents a comparative analysis of the ability of two large
language model (LLM)-based chatbots, ChatGPT and Bing Chat, recently rebranded
to Microsoft Copilot, to detect veracity of political information. We use AI
auditing methodology to investigate how chatbots evaluate true, false, and
borderline statements on five topics: COVID-19, Russian aggression against
Ukraine, the Holocaust, climate change, and LGBTQ+ related debates. We compare
how the chatbots perform in high- and low-resource languages by using prompts
in English, Russian, and Ukrainian. Furthermore, we explore the ability of
chatbots to evaluate statements according to political communication concepts
of disinformation, misinformation, and conspiracy theory, using
definition-oriented prompts. We also systematically test how such evaluations
are influenced by source bias which we model by attributing specific claims to
various political and social actors. The results show high performance of
ChatGPT for the baseline veracity evaluation task, with 72 percent of the cases
evaluated correctly on average across languages without pre-training. Bing Chat
performed worse with a 67 percent accuracy. We observe significant disparities
in how chatbots evaluate prompts in high- and low-resource languages and how
they adapt their evaluations to political communication concepts with ChatGPT
providing more nuanced outputs than Bing Chat. Finally, we find that for some
veracity detection-related tasks, the performance of chatbots varied depending
on the topic of the statement or the source to which it is attributed. These
findings highlight the potential of LLM-based chatbots in tackling different
forms of false information in online environments, but also points to the
substantial variation in terms of how such potential is realized due to
specific factors, such as language of the prompt or the topic.
Related papers
- A Linguistic Comparison between Human and ChatGPT-Generated Conversations [9.022590646680095]
The research employs Linguistic Inquiry and Word Count analysis, comparing ChatGPT-generated conversations with human conversations.
Results show greater variability and authenticity in human dialogues, but ChatGPT excels in categories such as social processes, analytical style, cognition, attentional focus, and positive emotional tone.
arXiv Detail & Related papers (2024-01-29T21:43:27Z) - Demonstrations of the Potential of AI-based Political Issue Polling [0.0]
We develop a prompt engineering methodology for eliciting human-like survey responses from ChatGPT.
We execute large scale experiments, querying for thousands of simulated responses at a cost far lower than human surveys.
We find ChatGPT is effective at anticipating both the mean level and distribution of public opinion on a variety of policy issues.
But it is less successful at anticipating demographic-level differences.
arXiv Detail & Related papers (2023-07-10T12:17:15Z) - Adding guardrails to advanced chatbots [5.203329540700177]
Launch of ChatGPT in November 2022 has ushered in a new era of AI.
There are already concerns that humans may be replaced by chatbots for a variety of jobs.
These biases may cause significant harm and/or inequity toward different subpopulations.
arXiv Detail & Related papers (2023-06-13T02:23:04Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z) - Addressing Inquiries about History: An Efficient and Practical Framework
for Evaluating Open-domain Chatbot Consistency [28.255324166852535]
We propose the Addressing Inquiries about History (AIH) framework for the consistency evaluation.
At the conversation stage, AIH attempts to address appropriate inquiries about the dialogue history to induce the chatbots to redeclare the historical facts or opinions.
At the contradiction recognition stage, we can either employ human judges or a natural language inference (NLI) model to recognize whether the answers to the inquiries are contradictory with history.
arXiv Detail & Related papers (2021-06-04T03:04:13Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z) - FinChat: Corpus and evaluation setup for Finnish chat conversations on
everyday topics [15.94497202872835]
We describe our collection efforts to create the Finnish chat conversation corpus FinChat, made available publicly.
FinChat includes unscripted conversations on seven topics from people of different ages.
In a human evaluation, responses to questions from the evaluation set generated by the chatbots are predominantly marked as incoherent.
arXiv Detail & Related papers (2020-08-19T07:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.