GenAI vs. Human Fact-Checkers: Accurate Ratings, Flawed Rationales
- URL: http://arxiv.org/abs/2502.14943v3
- Date: Tue, 25 Feb 2025 19:06:04 GMT
- Title: GenAI vs. Human Fact-Checkers: Accurate Ratings, Flawed Rationales
- Authors: Yuehong Cassandra Tai, Khushi Navin Patni, Nicholas Daniel Hemauer, Bruce Desmarais, Yu-Ru Lin,
- Abstract summary: GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders.<n>We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy.
- Score: 2.3475022003300055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in understanding the capabilities and limits of generative artificial intelligence (GenAI) models, we are just beginning to understand their capacity to assess and reason about the veracity of content. We evaluate multiple GenAI models across tasks that involve the rating of, and perceived reasoning about, the credibility of information. The information in our experiments comes from content that subnational U.S. politicians post to Facebook. We find that GPT-4o, one of the most used AI models in consumer applications, outperforms other models, but all models exhibit only moderate agreement with human coders. Importantly, even when GenAI models accurately identify low-credibility content, their reasoning relies heavily on linguistic features and ``hard'' criteria, such as the level of detail, source reliability, and language formality, rather than an understanding of veracity. We also assess the effectiveness of summarized versus full content inputs, finding that summarized content holds promise for improving efficiency without sacrificing accuracy. While GenAI has the potential to support human fact-checkers in scaling misinformation detection, our results caution against relying solely on these models.
Related papers
- Human Trust in AI Search: A Large-Scale Experiment [0.07589017023705934]
generative artificial intelligence (GenAI) can influence what we buy, how we vote and our health.
No work establishes the causal effect of generative search designs on human trust.
We execute 12,000 search queries across seven countries, generating 80,000 real-time GenAI and traditional search results.
arXiv Detail & Related papers (2025-04-08T21:12:41Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.
We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.
Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Detecting AI-Generated Text in Educational Content: Leveraging Machine Learning and Explainable AI for Academic Integrity [1.1137087573421256]
This study seeks to enhance academic integrity by providing tools to detect AI-generated content in student work.<n>We evaluate various machine learning (ML) and deep learning (DL) algorithms on the CyberHumanAI dataset.<n>Our proposed model achieved approximately 77.5% accuracy compared to GPTZero's 48.5% accuracy when tasked to classify Pure AI, Pure Human, and mixed class.
arXiv Detail & Related papers (2025-01-06T18:34:20Z) - Learning to Generate and Evaluate Fact-checking Explanations with Transformers [10.970249299147866]
Research contributes to the field of Explainable Artificial Antelligence (XAI)
We develop transformer-based fact-checking models that contextualise and justify their decisions by generating human-accessible explanations.
We emphasise the need for aligning Artificial Intelligence (AI)-generated explanations with human judgements.
arXiv Detail & Related papers (2024-10-21T06:22:51Z) - GenAI Arena: An Open Evaluation Platform for Generative Models [33.246432399321826]
This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models.
GenAI-Arena aims to provide a more democratic and accurate measure of model performance.
It covers three tasks of text-to-image generation, text-to-video generation, and image editing respectively.
arXiv Detail & Related papers (2024-06-06T20:15:42Z) - The Influencer Next Door: How Misinformation Creators Use GenAI [1.1650821883155187]
We find that non-experts increasingly use GenAI to remix, repackage, and (re)produce content to meet their personal needs and desires.
We analyze how these understudied emergent uses of GenAI produce new or accelerated misinformation harms.
arXiv Detail & Related papers (2024-05-22T11:40:22Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - Physics of Language Models: Part 3.2, Knowledge Manipulation [51.68385617116854]
This paper investigates four fundamental knowledge manipulation tasks.
We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks.
Our findings also apply to modern pretrained language models such as GPT-4.
arXiv Detail & Related papers (2023-09-25T17:50:41Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - A Comprehensive Survey of AI-Generated Content (AIGC): A History of
Generative AI from GAN to ChatGPT [63.58711128819828]
ChatGPT and other Generative AI (GAI) techniques belong to the category of Artificial Intelligence Generated Content (AIGC)
The goal of AIGC is to make the content creation process more efficient and accessible, allowing for the production of high-quality content at a faster pace.
arXiv Detail & Related papers (2023-03-07T20:36:13Z) - Aligning AI With Shared Human Values [85.2824609130584]
We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality.
We find that current language models have a promising but incomplete ability to predict basic human ethical judgements.
Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
arXiv Detail & Related papers (2020-08-05T17:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.