LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems
- URL: http://arxiv.org/abs/2601.16890v1
- Date: Fri, 23 Jan 2026 16:57:16 GMT
- Title: LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems
- Authors: João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton,
- Abstract summary: We introduce a novel class of persuasive adversarial attacks on automated fact-checking systems.<n>We study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy.<n>Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.
- Score: 9.795192821776462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.
Related papers
- DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems [38.6944646666426]
We study adversarial claim attacks against search-enabled fact-checking systems under a realistic input-only threat model.<n>We propose DECEIVE-AFC, an agent-based adversarial attack framework that integrates novel claim-level attack strategies and adversarial claim validity evaluation principles.<n>Our attacks substantially degrade verification performance, reducing accuracy from 78.7% to 53.7%, and significantly outperform existing claim-based attack baselines with strong cross-system transferability.
arXiv Detail & Related papers (2026-01-31T03:49:23Z) - Adversarial Attacks Against Automated Fact-Checking: A Survey [36.08022268176274]
This survey provides the first in-depth review of adversarial attacks targeting fact-checking systems.<n>We examine recent advancements in adversary-aware defenses and highlight open research questions.<n>Our findings underscore the urgent need for resilient FC frameworks capable of withstanding adversarial manipulations.
arXiv Detail & Related papers (2025-09-10T10:10:10Z) - Concealment of Intent: A Game-Theoretic Analysis [15.387256204743407]
We present a scalable attack strategy: intent-hiding adversarial prompting, which conceals malicious intent through the composition of skills.<n>Our analysis identifies equilibrium points and reveals structural advantages for the attacker.<n> Empirically, we validate the attack's effectiveness on multiple real-world LLMs across a range of malicious behaviors.
arXiv Detail & Related papers (2025-05-27T07:59:56Z) - Topic-FlipRAG: Topic-Orientated Adversarial Opinion Manipulation Attacks to Retrieval-Augmented Generation Models [22.296368955665475]
We propose a two-stage manipulation attack pipeline that crafts adversarial perturbations to influence opinions across related queries.<n> Experiments show that the proposed attacks effectively shift the opinion of the model's outputs on specific topics.
arXiv Detail & Related papers (2025-02-03T14:21:42Z) - Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.<n>We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.<n>Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - Defense of Adversarial Ranking Attack in Text Retrieval: Benchmark and
Baseline via Detection [12.244543468021938]
This paper introduces two types of detection tasks for adversarial documents.
A benchmark dataset is established to facilitate the investigation of adversarial ranking defense.
A comprehensive investigation of the performance of several detection baselines is conducted.
arXiv Detail & Related papers (2023-07-31T16:31:24Z) - Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against
Fact-Verification Systems [80.3811072650087]
We show that it is possible to subtly modify claim-salient snippets in the evidence and generate diverse and claim-aligned evidence.
The attacks are also robust against post-hoc modifications of the claim.
These attacks can have harmful implications on the inspectable and human-in-the-loop usage scenarios.
arXiv Detail & Related papers (2022-09-07T13:39:24Z) - Illusory Attacks: Information-Theoretic Detectability Matters in Adversarial Attacks [76.35478518372692]
We introduce epsilon-illusory, a novel form of adversarial attack on sequential decision-makers.
Compared to existing attacks, we empirically find epsilon-illusory to be significantly harder to detect with automated methods.
Our findings suggest the need for better anomaly detectors, as well as effective hardware- and system-level defenses.
arXiv Detail & Related papers (2022-07-20T19:49:09Z) - Adversarial Robustness of Deep Reinforcement Learning based Dynamic
Recommender Systems [50.758281304737444]
We propose to explore adversarial examples and attack detection on reinforcement learning-based interactive recommendation systems.
We first craft different types of adversarial examples by adding perturbations to the input and intervening on the casual factors.
Then, we augment recommendation systems by detecting potential attacks with a deep learning-based classifier based on the crafted data.
arXiv Detail & Related papers (2021-12-02T04:12:24Z) - Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning [95.60856995067083]
This work is among the first to perform adversarial defense for ASV without knowing the specific attack algorithms.
We propose to perform adversarial defense from two perspectives: 1) adversarial perturbation purification and 2) adversarial perturbation detection.
Experimental results show that our detection module effectively shields the ASV by detecting adversarial samples with an accuracy of around 80%.
arXiv Detail & Related papers (2021-06-01T07:10:54Z) - Adversarial Attacks and Detection on Reinforcement Learning-Based
Interactive Recommender Systems [47.70973322193384]
Adversarial attacks pose significant challenges for detecting them at an early stage.
We propose attack-agnostic detection on reinforcement learning-based interactive recommendation systems.
We first craft adversarial examples to show their diverse distributions and then augment recommendation systems by detecting potential attacks.
arXiv Detail & Related papers (2020-06-14T15:41:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.