Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild
- URL: http://arxiv.org/abs/2601.22871v1
- Date: Fri, 30 Jan 2026 11:49:58 GMT
- Title: Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild
- Authors: Alexander Loth, Martin Kappes, Marc-Oliver Pahl,
- Abstract summary: "Fake news familiarity" emerges as a candidate mediator, suggesting that exposure may function as adversarial training for human discriminators.<n>These findings suggest that "pre-bunking" interventions should target cognitive source monitoring rather than demographic segmentation.
- Score: 47.03825808787752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence. This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples "authenticity" from "attribution" to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy. Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance ($r=-0.10$). Instead, "fake news familiarity" emerges as a candidate mediator ($r=0.35$), suggesting that exposure may function as adversarial training for human discriminators. We identify a "fluency trap" where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text. These findings suggest that "pre-bunking" interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems.
Related papers
- Human Supervision as an Information Bottleneck: A Unified Theory of Error Floors in Human-Guided Learning [51.56484100374058]
We argue that limitations reflect structural properties of the supervision channel rather than model scale or optimization.<n>We develop a unified theory showing that whenever the human supervision channel is not sufficient for a latent evaluation target, it acts as an information-reducing channel.
arXiv Detail & Related papers (2026-02-26T19:11:32Z) - Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation [23.545262620377887]
In machine learning, "ground truth" refers to the assumed correct labels used to train and evaluate models.<n>This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues.
arXiv Detail & Related papers (2026-02-11T19:45:17Z) - The Necessity of Imperfection:Reversing Model Collapse via Simulating Cognitive Boundedness [0.284279467589473]
This paper proposes a paradigm shift: instead of imitating the surface properties of data, we simulate the cognitive processes that generate human text.<n>We introduce the Prompt-driven Cognitive Computing Framework (PMCSF) that reverse-engineers unstructured text into structured cognitive vectors.<n>Our findings demonstrate that modelling human cognitive limitations -- not copying surface data -- enables synthetic data with genuine functional gain.
arXiv Detail & Related papers (2025-12-01T07:09:38Z) - Fair Deepfake Detectors Can Generalize [51.21167546843708]
We show that controlling for confounders (data distribution and model capacity) enables improved generalization via fairness interventions.<n>Motivated by this insight, we propose Demographic Attribute-insensitive Intervention Detection (DAID), a plug-and-play framework composed of: i) Demographic-aware data rebalancing, which employs inverse-propensity weighting and subgroup-wise feature normalization to neutralize distributional biases; and ii) Demographic-agnostic feature aggregation, which uses a novel alignment loss to suppress sensitive-attribute signals.<n>DAID consistently achieves superior performance in both fairness and generalization compared to several state-of-the-art
arXiv Detail & Related papers (2025-07-03T14:10:02Z) - The Traitors: Deception and Trust in Multi-Agent Language Model Simulations [0.0]
We introduce The Traitors, a multi-agent simulation framework inspired by social deduction games.<n>We develop a suite of evaluation metrics capturing deception success, trust dynamics, and collective inference quality.<n>Our initial experiments across DeepSeek-V3, GPT-4o-mini, and GPT-4o (10 runs per model) reveal a notable asymmetry.
arXiv Detail & Related papers (2025-05-19T10:01:35Z) - Benchmark on Peer Review Toxic Detection: A Challenging Task with a New Dataset [6.106100820330045]
This work explores an important but underexplored area: detecting toxicity in peer reviews.<n>We first define toxicity in peer reviews across four distinct categories and curate a dataset of peer reviews from the OpenReview platform.<n>We benchmark a variety of models, including a dedicated toxicity detection model and a sentiment analysis model.
arXiv Detail & Related papers (2025-02-01T23:01:39Z) - Exploring the Potential of the Large Language Models (LLMs) in Identifying Misleading News Headlines [2.0330684186105805]
This study explores the efficacy of Large Language Models (LLMs) in identifying misleading versus non-misleading news headlines.
Our analysis reveals significant variance in model performance, with ChatGPT-4 demonstrating superior accuracy.
arXiv Detail & Related papers (2024-05-06T04:06:45Z) - Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable.
Existing susceptibility studies heavily rely on self-reported beliefs.
We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.