Related papers: Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

URL: http://arxiv.org/abs/2510.14207v2
Date: Mon, 20 Oct 2025 19:46:34 GMT
Title: Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks
Authors: Trilok Padhi, Pinxian Lu, Abdulkadir Erol, Tanmay Sutar, Gauri Sharma, Mina Sonmez, Munmun De Choudhury, Ugur Kursuncu,
Abstract summary: Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm.<n>We present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework.
Score: 10.7231991032233
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.

Related papers

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning [1.8047694351309207]
We formalize a threat model where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone.<n>We present Slingshot, a reinforcement learning framework that autonomously discovers emergent attack vectors.<n>Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.
arXiv Detail & Related papers (2026-02-02T17:56:55Z)
ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models [67.15960154375131]
Large reasoning models (LRMs) extend large language models with explicit multi-step reasoning traces.<n>This capability introduces a new class of prompt-induced inference-time denial-of-service (PI-DoS) attacks that exploit the high computational cost of reasoning.<n>We present ReasoningBomb, a reinforcement-learning-based PI-DoS framework that is guided by a constant-time surrogate reward.
arXiv Detail & Related papers (2026-01-29T18:53:01Z)
Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks [0.0]
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address.<n>We conduct the first systematic testing and comparative evaluation of agentic AI systems.<n>We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy.
arXiv Detail & Related papers (2025-12-16T19:22:50Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
Large Reasoning Models Are Autonomous Jailbreak Agents [9.694940903078656]
Jailbreaking -- bypassing built-in safety mechanisms in AI models -- has traditionally required complex technical procedures or specialized human expertise.<n>We show that the persuasive capabilities of large reasoning models (LRMs) simplify and scale jailbreaking.<n>Our study reveals an alignment regression, in which LRMs can systematically erode the safety guardrails of other models.
arXiv Detail & Related papers (2025-08-04T18:27:26Z)
Adversarial Preference Learning for Robust LLM Alignment [24.217309343426297]
Adversarial Preference Learning (APL) is an iterative adversarial training method incorporating three key innovations.<n>First, a direct harmfulness metric based on the model's intrinsic preference probabilities.<n>Second, a conditional generative attacker that synthesizes input-specific adversarial variations.
arXiv Detail & Related papers (2025-05-30T09:02:07Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage.<n>We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility.<n> Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
Intrinsic Model Weaknesses: How Priming Attacks Unveil Vulnerabilities in Large Language Models [40.180771969531456]
Large language models (LLMs) have significantly influenced various industries but suffer from a critical flaw, the potential sensitivity of generating harmful content.<n>We developed and tested novel attack strategies on popular LLMs to expose their vulnerabilities in generating inappropriate content.
arXiv Detail & Related papers (2025-02-23T08:09:23Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
G$^2$uardFL: Safeguarding Federated Learning Against Backdoor Attacks through Attributed Client Graph Clustering [116.4277292854053]
Federated Learning (FL) offers collaborative model training without data sharing. FL is vulnerable to backdoor attacks, where poisoned model weights lead to compromised system integrity. We present G$2$uardFL, a protective framework that reinterprets the identification of malicious clients as an attributed graph clustering problem.
arXiv Detail & Related papers (2023-06-08T07:15:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.