Related papers: Enhancing Jailbreak Attacks on LLMs via Persona Prompts

Enhancing Jailbreak Attacks on LLMs via Persona Prompts

URL: http://arxiv.org/abs/2507.22171v1
Date: Mon, 28 Jul 2025 12:03:22 GMT
Title: Enhancing Jailbreak Attacks on LLMs via Persona Prompts
Authors: Zheng Zhang, Peilin Zhao, Deheng Ye, Hao Wang,
Abstract summary: Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities.<n>Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts.<n>We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms.
Score: 39.73624426612256
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Jailbreak attacks aim to exploit large language models (LLMs) by inducing them to generate harmful content, thereby revealing their vulnerabilities. Understanding and addressing these attacks is crucial for advancing the field of LLM safety. Previous jailbreak approaches have mainly focused on direct manipulations of harmful intent, with limited attention to the impact of persona prompts. In this study, we systematically explore the efficacy of persona prompts in compromising LLM defenses. We propose a genetic algorithm-based method that automatically crafts persona prompts to bypass LLM's safety mechanisms. Our experiments reveal that: (1) our evolved persona prompts reduce refusal rates by 50-70% across multiple LLMs, and (2) these prompts demonstrate synergistic effects when combined with existing attack methods, increasing success rates by 10-20%. Our code and data are available at https://github.com/CjangCjengh/Generic_Persona.

Related papers

Dagger Behind Smile: Fool LLMs with a Happy Ending Story [3.474162324046381]
Happy Ending Attack (HEA) wraps up a malicious request in a scenario template involving a positive prompt formed mainly via a $textithappy ending$, it thus fools LLMs into jailbreaking either immediately or at a follow-up malicious request.<n>Our HEA can successfully jailbreak on state-of-the-art LLMs, including GPT-4o, Llama3-70b, Gemini-pro, and achieves 88.79% attack success rate on average.
arXiv Detail & Related papers (2025-01-19T13:39:51Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
LLMs can be Dangerous Reasoners: Analyzing-based Jailbreak Attack on Large Language Models [21.02295266675853]
We propose a novel black-box jailbreak attack method, Analyzing-based Jailbreak (ABJ)<n>ABJ comprises two independent attack paths, which exploit the model's multimodal reasoning capabilities to bypass safety mechanisms.<n>Our work reveals a new type of safety risk and highlights the urgent need to mitigate implicit vulnerabilities in the model's reasoning process.
arXiv Detail & Related papers (2024-07-23T06:14:41Z)
Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z)
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [47.81417828399084]
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.<n>This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks.
arXiv Detail & Related papers (2024-06-16T03:38:48Z)
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries [22.24239212756129]
We find that simply appending multiple end of sequence (eos) tokens can cause a phenomenon we call context segmentation.<n>We propose a straightforward method to BOOST jailbreak attacks by appending eos tokens.
arXiv Detail & Related papers (2024-05-31T07:41:03Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks. Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query. We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z)
Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia [45.682497310103386]
Large Language Models (LLMs) are prone to specialized jailbreaking prompts that bypass safety measures to elicit violent and harmful content. We introduce RIPPLE, a novel optimization-based method inspired by two psychological concepts. We show RIPPLE achieves an average Attack Success Rate of 91.5%, outperforming five current methods by up to 47.0% with an 8x reduction in overhead.
arXiv Detail & Related papers (2024-02-08T07:56:49Z)
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs [66.05593434288625]
This paper introduces a new perspective to jailbreak large language models (LLMs) as human-like communicators. We apply a persuasion taxonomy derived from decades of social science research to generate persuasive adversarial prompts (PAP) to jailbreak LLMs. PAP consistently achieves an attack success rate of over $92%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses.
arXiv Detail & Related papers (2024-01-12T16:13:24Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs) Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked [19.242818141154086]
Large language models (LLMs) are popular for high-quality text generation. LLMs can produce harmful content even when aligned with human values. We propose LLM Self Defense, a simple approach to defend against these attacks.
arXiv Detail & Related papers (2023-08-14T17:54:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.