It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
- URL: http://arxiv.org/abs/2506.02873v1
- Date: Tue, 03 Jun 2025 13:37:51 GMT
- Title: It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics
- Authors: Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine,
- Abstract summary: We introduce an automated model to identify willingness to persuade and measure the frequency and context of persuasive attempts.<n>We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics.
- Score: 5.418014947856176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Persuade Eval (APE) benchmark, that shifts the focus from persuasion success to persuasion attempts, operationalized as a model's willingness to generate content aimed at shaping beliefs or behavior. Our evaluation framework probes frontier LLMs using a multi-turn conversational setup between simulated persuader and persuadee agents. APE explores a diverse spectrum of topics including conspiracies, controversial issues, and non-controversially harmful content. We introduce an automated evaluator model to identify willingness to persuade and measure the frequency and context of persuasive attempts. We find that many open and closed-weight models are frequently willing to attempt persuasion on harmful topics and that jailbreaking can increase willingness to engage in such behavior. Our results highlight gaps in current safety guardrails and underscore the importance of evaluating willingness to persuade as a key dimension of LLM risk. APE is available at github.com/AlignmentResearch/AttemptPersuadeEval
Related papers
- How Do LLMs Persuade? Linear Probes Can Uncover Persuasion Dynamics in Multi-Turn Conversations [11.221875709359974]
Large Language Models (LLMs) have started to demonstrate the ability to persuade humans.<n>Recent work has used linear probes, lightweight tools for analyzing model representations, to study various LLM skills.<n>Motivated by this, we apply probes to study persuasion dynamics in natural, multi-turn conversations.
arXiv Detail & Related papers (2025-08-07T17:58:41Z) - Must Read: A Systematic Survey of Computational Persuasion [60.83151988635103]
AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through manipulation and unethical influence.<n>Our survey outlines future research directions to enhance the safety, fairness, and effectiveness of AI-powered persuasion.
arXiv Detail & Related papers (2025-05-12T17:26:31Z) - LLM Can be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models [47.27098710953806]
We introduce PersuSafety, the first comprehensive framework for the assessment of persuasion safety.<n>PersuSafety covers 6 diverse unethical persuasion topics and 15 common unethical strategies.<n>Our study calls for more attention to improve safety alignment in progressive and goal-driven conversations such as persuasion.
arXiv Detail & Related papers (2025-04-14T17:20:34Z) - Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models [9.402740034754455]
Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion.<n>LLMs' susceptibility to persuasion raises concerns about alignment with ethical principles.<n>We introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasion through multi-agent interactions.
arXiv Detail & Related papers (2025-03-03T18:53:21Z) - Mind What You Ask For: Emotional and Rational Faces of Persuasion by Large Language Models [0.0]
Large language models (LLMs) are increasingly effective at persuading us that their answers are valuable.<n>This study examines what are the psycholinguistic features of the responses used by twelve different language models.<n>We ask whether and how we can mitigate the risks of LLM-driven mass misinformation.
arXiv Detail & Related papers (2025-02-13T15:15:53Z) - Compromising Honesty and Harmlessness in Language Models via Deception Attacks [0.04499833362998487]
Large language models (LLMs) can understand and employ deceptive behavior, even without explicit prompting.<n>We introduce "deception attacks" that undermine these traits, revealing a vulnerability that, if exploited, could have serious real-world consequences.<n>We show that such targeted deception is effective even in high-stakes domains or ideologically charged subjects.
arXiv Detail & Related papers (2025-02-12T11:02:59Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - Persuasion with Large Language Models: a Survey [49.86930318312291]
Large Language Models (LLMs) have created new disruptive possibilities for persuasive communication.
In areas such as politics, marketing, public health, e-commerce, and charitable giving, such LLM Systems have already achieved human-level or even super-human persuasiveness.
Our survey suggests that the current and future potential of LLM-based persuasion poses profound ethical and societal risks.
arXiv Detail & Related papers (2024-11-11T10:05:52Z) - Measuring and Improving Persuasiveness of Large Language Models [12.134372070736596]
We introduce PersuasionBench and PersuasionArena to measure the persuasiveness of generative models automatically.
Our findings carry key implications for both model developers and policymakers.
arXiv Detail & Related papers (2024-10-03T16:36:35Z) - How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to
Challenge AI Safety by Humanizing LLMs [66.05593434288625]
This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators.
We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs.
PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials.
On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
arXiv Detail & Related papers (2024-01-12T16:13:24Z) - Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples.
Yet, its passive nature inevitably prevents it from being immune to unknown attackers.
We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.