Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
- URL: http://arxiv.org/abs/2508.17450v3
- Date: Tue, 09 Sep 2025 05:04:04 GMT
- Title: Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
- Authors: Bryan Chen Zhengyu Tan, Daniel Wai Kit Chin, Zhengyuan Liu, Nancy F. Chen, Roy Ka-Wei Lee,
- Abstract summary: Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues.<n>We introduce DuET-PD, a framework evaluating multi-turn stance-change dynamics across dual dimensions.<n>We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions.
- Score: 46.5669887497759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.
Related papers
- Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions [8.026492468995187]
Small models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn.<n> meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness.<n>These findings highlight substantial model-dependent limits of current robustness interventions.
arXiv Detail & Related papers (2026-01-20T04:43:55Z) - Demystifying Multi-Agent Debate: The Role of Confidence and Diversity [31.236476720977294]
Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling.<n>Recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost.<n>We identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication.
arXiv Detail & Related papers (2026-01-09T02:38:30Z) - MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion [73.99171322670772]
Large Vision-Language Models (LVLMs) are increasingly deployed in domains such as shopping, health, and news.<n> MMPersuade provides a unified framework for systematically studying multimodal persuasion dynamics in LVLMs.
arXiv Detail & Related papers (2025-10-26T17:39:21Z) - Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z) - Enhancing Multi-Agent Debate System Performance via Confidence Expression [55.34012400580016]
Multi-Agent Debate (MAD) systems simulate human debate and thereby improve task performance.<n>Some Large Language Models (LLMs) possess superior knowledge or reasoning capabilities for specific tasks, but struggle to clearly communicate this advantage during debates.<n>Inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers.<n>We develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process.
arXiv Detail & Related papers (2025-09-17T14:34:27Z) - On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks [23.95254828487318]
We present the first comprehensive study on the robustness of verbal confidence under adversarial attacks.<n>We introduce a novel framework for attacking verbal confidence scores through both perturbation and jailbreak-based methods.<n>Our findings underscore the urgent need to design more robust mechanisms for confidence expression in large language models.
arXiv Detail & Related papers (2025-07-09T02:19:46Z) - SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization [57.69385990442078]
Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions)<n>Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates.
arXiv Detail & Related papers (2025-05-18T10:20:59Z) - Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models [9.402740034754455]
Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion.<n>LLMs' susceptibility to persuasion raises concerns about alignment with ethical principles.<n>We introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasion through multi-agent interactions.
arXiv Detail & Related papers (2025-03-03T18:53:21Z) - Adversarial Prompt Distillation for Vision-Language Models [63.24270920122456]
Adversarial Prompt Tuning (APT) applies adversarial training during the process of prompt tuning.<n>APD is a bimodal knowledge distillation framework that enhances APT by integrating it with multi-modal knowledge transfer.<n>Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD method over the current state-of-the-art APT methods.
arXiv Detail & Related papers (2024-11-22T03:02:13Z) - Teaching Models to Balance Resisting and Accepting Persuasion [69.68379406317682]
We show that Persuasion-Training (or PBT) can balance positive and negative persuasion.<n>PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models.<n>We find that PBT leads to better and more stable results and less order dependence.
arXiv Detail & Related papers (2024-10-18T16:49:36Z) - Counterfactual Reasoning Using Predicted Latent Personality Dimensions for Optimizing Persuasion Outcome [13.731895847081953]
We present a novel approach that tracks a user's latent personality dimensions (LPDs) during ongoing persuasion conversation.
We generate tailored counterfactual utterances based on these LPDs to optimize the overall persuasion outcome.
arXiv Detail & Related papers (2024-04-21T23:03:47Z) - LaMDA: Language Models for Dialog Applications [75.75051929981933]
LaMDA is a family of Transformer-based neural language models specialized for dialog.
Fine-tuning with annotated data and enabling the model to consult external knowledge sources can lead to significant improvements.
arXiv Detail & Related papers (2022-01-20T15:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.