News Verifiers Showdown: A Comparative Performance Evaluation of ChatGPT
3.5, ChatGPT 4.0, Bing AI, and Bard in News Fact-Checking
- URL: http://arxiv.org/abs/2306.17176v1
- Date: Sun, 18 Jun 2023 04:30:29 GMT
- Title: News Verifiers Showdown: A Comparative Performance Evaluation of ChatGPT
3.5, ChatGPT 4.0, Bing AI, and Bard in News Fact-Checking
- Authors: Kevin Matthe Caramancion
- Abstract summary: OpenAI's ChatGPT 3.5 and 4.0, Google's Bard(LaMDA), and Microsoft's Bing AI were evaluated.
The results showed a moderate proficiency across all models, with an average score of 65.25 out of 100.
OpenAI's GPT-4.0 stood out with a score of 71, suggesting an edge in newer LLMs' abilities to differentiate fact from deception.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study aimed to evaluate the proficiency of prominent Large Language
Models (LLMs), namely OpenAI's ChatGPT 3.5 and 4.0, Google's Bard(LaMDA), and
Microsoft's Bing AI in discerning the truthfulness of news items using black
box testing. A total of 100 fact-checked news items, all sourced from
independent fact-checking agencies, were presented to each of these LLMs under
controlled conditions. Their responses were classified into one of three
categories: True, False, and Partially True/False. The effectiveness of the
LLMs was gauged based on the accuracy of their classifications against the
verified facts provided by the independent agencies. The results showed a
moderate proficiency across all models, with an average score of 65.25 out of
100. Among the models, OpenAI's GPT-4.0 stood out with a score of 71,
suggesting an edge in newer LLMs' abilities to differentiate fact from
deception. However, when juxtaposed against the performance of human
fact-checkers, the AI models, despite showing promise, lag in comprehending the
subtleties and contexts inherent in news information. The findings highlight
the potential of AI in the domain of fact-checking while underscoring the
continued importance of human cognitive skills and the necessity for persistent
advancements in AI capabilities. Finally, the experimental data produced from
the simulation of this work is openly available on Kaggle.
Related papers
- From Data to Commonsense Reasoning: The Use of Large Language Models for Explainable AI [0.0]
We study the effectiveness of large language models (LLMs) on different question answering tasks.
We demonstrate the ability of LLMs to reason with commonsense as the models outperform humans on different datasets.
Our questionnaire revealed that 66% of participants rated GPT-3.5's explanations as either "good" or "excellent"
arXiv Detail & Related papers (2024-07-04T09:38:49Z) - Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines [2.0330684186105805]
This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines.
Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy.
arXiv Detail & Related papers (2024-05-06T04:06:45Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Fact-checking information from large language models can decrease headline discernment [6.814801748069122]
We investigate the impact of fact-checking information generated by a popular large language model on belief in, and sharing intent of, political news headlines.
We find that this information does not significantly improve participants' ability to discern headline accuracy or share accurate news.
Our findings highlight an important source of potential harm stemming from AI applications.
arXiv Detail & Related papers (2023-08-21T15:47:37Z) - Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality.
FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements.
We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z) - Inference-Time Intervention: Eliciting Truthful Answers from a Language Model [61.88942482411035]
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs)
ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads.
Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
arXiv Detail & Related papers (2023-06-06T01:26:53Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - Can ChatGPT and Bard Generate Aligned Assessment Items? A Reliability
Analysis against Human Performance [0.0]
ChatGPT and Bard are AI chatbots based on Large Language Models (LLM)
In education, these AI technologies have been tested for applications in assessment and teaching.
arXiv Detail & Related papers (2023-04-09T04:53:15Z) - FacTeR-Check: Semi-automated fact-checking through Semantic Similarity
and Natural Language Inference [61.068947982746224]
FacTeR-Check enables retrieving fact-checked information, unchecked claims verification and tracking dangerous information over social media.
The architecture is validated using a new dataset called NLI19-SP that is publicly released with COVID-19 related hoaxes and tweets from Spanish social media.
Our results show state-of-the-art performance on the individual benchmarks, as well as producing useful analysis of the evolution over time of 61 different hoaxes.
arXiv Detail & Related papers (2021-10-27T15:44:54Z) - Machine Learning Explanations to Prevent Overtrust in Fake News
Detection [64.46876057393703]
This research investigates the effects of an Explainable AI assistant embedded in news review platforms for combating the propagation of fake news.
We design a news reviewing and sharing interface, create a dataset of news stories, and train four interpretable fake news detection algorithms.
For a deeper understanding of Explainable AI systems, we discuss interactions between user engagement, mental model, trust, and performance measures in the process of explaining.
arXiv Detail & Related papers (2020-07-24T05:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.