Related papers: Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars

Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars

URL: http://arxiv.org/abs/2412.12145v4
Date: Sat, 22 Feb 2025 07:36:26 GMT
Title: Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars
Authors: Yu Yan, Sheng Sun, Junqi Tong, Min Liu, Qi Li,
Abstract summary: We introduce a novel attack framework that exploits the imaginative capacity of Large Language Models (LLMs) to achieve jailbreaking.<n>Specifically, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities.<n>Results demonstrate that AVATAR can effectively and transferablly jailbreak LLMs and achieve a state-of-the-art attack success rate.
Score: 13.496824581458547
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Metaphor serves as an implicit approach to convey information, while enabling the generalized comprehension of complex subjects. However, metaphor can potentially be exploited to bypass the safety alignment mechanisms of Large Language Models (LLMs), leading to the theft of harmful knowledge. In our study, we introduce a novel attack framework that exploits the imaginative capacity of LLMs to achieve jailbreaking, the J\underline{\textbf{A}}ilbreak \underline{\textbf{V}}ia \underline{\textbf{A}}dversarial Me\underline{\textbf{TA}} -pho\underline{\textbf{R}} (\textit{AVATAR}). Specifically, to elicit the harmful response, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities based on LLM's imagination. Then, according to these metaphors, the harmful target is nested within human-like interaction for jailbreaking adaptively. Experimental results demonstrate that AVATAR can effectively and transferablly jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs. Our study exposes a security risk in LLMs from their endogenous imaginative capabilities. Furthermore, the analytical study reveals the vulnerability of LLM to adversarial metaphors and the necessity of developing defense methods against jailbreaking caused by the adversarial metaphor. \textcolor{orange}{ \textbf{Warning: This paper contains potentially harmful content from LLMs.}}

Related papers

Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning [48.100552417137656]
PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions.<n>We conduct extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
arXiv Detail & Related papers (2025-09-28T01:38:00Z)
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs [16.95831588112687]
This study investigates the misuse threats of Large Language Models (LLMs) in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment.<n> Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns.
arXiv Detail & Related papers (2025-08-22T12:41:26Z)
from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors [11.783273882437824]
We introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the Large Language Models to calibrate malicious metaphors for jailbreaking. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs.
arXiv Detail & Related papers (2025-02-25T08:41:25Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models [61.916827858666906]
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.<n>This paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query.
arXiv Detail & Related papers (2024-12-24T05:10:02Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [47.81417828399084]
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.<n>This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks.
arXiv Detail & Related papers (2024-06-16T03:38:48Z)
Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts. We introduce BOOST, a simple attack that leverages only the eos tokens. Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response [23.344727384686898]
We analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses. Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query.
arXiv Detail & Related papers (2024-05-22T21:59:22Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements. We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z)
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks. Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query. We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z)
Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak [26.741029482196534]
"Jailbreak Attack" is phenomenon where Large Language Models (LLMs) generate harmful responses when faced with malicious instructions. We introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses. Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions.
arXiv Detail & Related papers (2023-12-07T08:29:58Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [38.437893814759086]
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns. We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
arXiv Detail & Related papers (2023-10-10T07:50:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.