from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
- URL: http://arxiv.org/abs/2503.00038v1
- Date: Tue, 25 Feb 2025 08:41:25 GMT
- Title: from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
- Authors: Yu Yan, Sheng Sun, Zenghao Duan, Teli Liu, Min Liu, Zhiyi Yin, Qi Li, Jiangyu Lei,
- Abstract summary: We introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the Large Language Models to calibrate malicious metaphors for jailbreaking.<n> Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs.
- Score: 11.783273882437824
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
Related papers
- ShallowJail: Steering Jailbreaks against Large Language Models [7.9152592631238425]
We introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs.<n>ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference.<n>Through extensive experiments, we demonstrate the effectiveness ofshallow, which substantially degrades the safety of state-of-the-art LLM responses.
arXiv Detail & Related papers (2026-02-06T18:35:38Z) - TrojanPraise: Jailbreak LLMs via Benign Fine-Tuning [4.961302575859445]
TrojanPraise is a novel finetuning-based attack exploiting benign and thus filter-approved data.<n>TrojanPraise achieves a maximum attack success rate of 95.88% while evading moderation.
arXiv Detail & Related papers (2026-01-18T15:48:36Z) - Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs [16.95831588112687]
This study investigates the misuse threats of Large Language Models (LLMs) in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment.<n> Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns.
arXiv Detail & Related papers (2025-08-22T12:41:26Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Na'vi or Knave: Jailbreaking Language Models via Metaphorical Avatars [13.496824581458547]
We introduce a novel attack framework that exploits the imaginative capacity of Large Language Models (LLMs) to achieve jailbreaking.<n>Specifically, AVATAR extracts harmful entities from a given harmful target and maps them to innocuous adversarial entities.<n>Results demonstrate that AVATAR can effectively and transferablly jailbreak LLMs and achieve a state-of-the-art attack success rate.
arXiv Detail & Related papers (2024-12-10T10:14:03Z) - IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks.
IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model.
It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z) - Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks.
This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions.
Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z) - Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation [44.09578786678573]
Large Language Models (LLMs) are implicit troublemakers.
LLMs could be used to gather harmful data or launch covert attacks.
We name this decoding strategy: Jailbreak Value Decoding (JVD)
arXiv Detail & Related papers (2024-08-20T09:11:21Z) - Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [47.81417828399084]
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.<n>This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks.
arXiv Detail & Related papers (2024-06-16T03:38:48Z) - Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models [107.88745040504887]
We study the harmlessness alignment problem of multimodal large language models (MLLMs)
Inspired by this, we propose a novel jailbreak method named HADES, which hides and amplifies the harmfulness of the malicious intent within the text input.
Experimental results show that HADES can effectively jailbreak existing MLLMs, which achieves an average Attack Success Rate (ASR) of 90.26% for LLaVA-1.5 and 71.60% for Gemini Pro Vision.
arXiv Detail & Related papers (2024-03-14T18:24:55Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z) - Multilingual Jailbreak Challenges in Large Language Models [96.74878032417054]
In this study, we reveal the presence of multilingual jailbreak challenges within large language models (LLMs)
We consider two potential risky scenarios: unintentional and intentional.
We propose a novel textscSelf-Defense framework that automatically generates multilingual training data for safety fine-tuning.
arXiv Detail & Related papers (2023-10-10T09:44:06Z) - Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations [38.437893814759086]
Large Language Models (LLMs) have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns.
We propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses.
arXiv Detail & Related papers (2023-10-10T07:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.