Related papers: Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild

Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild

URL: http://arxiv.org/abs/2601.22871v1
Date: Fri, 30 Jan 2026 11:49:58 GMT
Title: Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild
Authors: Alexander Loth, Martin Kappes, Marc-Oliver Pahl,
Abstract summary: "Fake news familiarity" emerges as a candidate mediator, suggesting that exposure may function as adversarial training for human discriminators.<n>These findings suggest that "pre-bunking" interventions should target cognitive source monitoring rather than demographic segmentation.
Score: 47.03825808787752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence. This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples "authenticity" from "attribution" to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy. Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance ($r=-0.10$). Instead, "fake news familiarity" emerges as a candidate mediator ($r=0.35$), suggesting that exposure may function as adversarial training for human discriminators. We identify a "fluency trap" where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text. These findings suggest that "pre-bunking" interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems.

Related papers

Human Supervision as an Information Bottleneck: A Unified Theory of Error Floors in Human-Guided Learning [51.56484100374058]
We argue that limitations reflect structural properties of the supervision channel rather than model scale or optimization.<n>We develop a unified theory showing that whenever the human supervision channel is not sufficient for a latent evaluation target, it acts as an information-reducing channel.
arXiv Detail & Related papers (2026-02-26T19:11:32Z)
Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation [23.545262620377887]
In machine learning, "ground truth" refers to the assumed correct labels used to train and evaluate models.<n>This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues.
arXiv Detail & Related papers (2026-02-11T19:45:17Z)
The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness [0.284279467589473]
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text.<n>We introduce the Prompt-driven Cognitive Computing Framework (PMCSF) that reverse-engineers unstructured text into structured cognitive vectors.<n>Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain.
arXiv Detail & Related papers (2025-12-01T07:09:38Z)
Fair Deepfake Detectors Can Generalize [51.21167546843708]
We show that controlling for confounders (data distribution and model capacity) enables improved generalization via fairness interventions.<n>Motivated by this insight, we propose Demographic Attribute-insensitive Intervention Detection (DAID), a plug-and-play framework composed of: i) Demographic-aware data rebalancing, which employs inverse-propensity weighting and subgroup-wise feature normalization to neutralize distributional biases; and ii) Demographic-agnostic feature aggregation, which uses a novel alignment loss to suppress sensitive-attribute signals.<n>DAID consistently achieves superior performance in both fairness and generalization compared to several state-of-the-art
arXiv Detail & Related papers (2025-07-03T14:10:02Z)
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations [0.0]
We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games.<n>We develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality.<n>Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry.
arXiv Detail & Related papers (2025-05-19T10:01:35Z)
Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset [6.106100820330045]
This work explores an important but underexplored area: detecting toxicity in peer reviews.<n>We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform.<n>We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
arXiv Detail & Related papers (2025-02-01T23:01:39Z)
Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines [2.0330684186105805]
This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines. Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy.
arXiv Detail & Related papers (2024-05-06T04:06:45Z)
Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable. Existing susceptibility studies heavily rely on self-reported beliefs. We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z)
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.