Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
- URL: http://arxiv.org/abs/2406.10794v2
- Date: Wed, 26 Jun 2024 13:50:32 GMT
- Title: Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
- Authors: Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang,
- Abstract summary: Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.
This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks.
- Score: 47.81417828399084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
Related papers
- Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Enhancing Jailbreak Attack Against Large Language Models through Silent Tokens [22.24239212756129]
Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft prompts.
We introduce BOOST, a simple attack that leverages only the eos tokens.
Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.
arXiv Detail & Related papers (2024-05-31T07:41:03Z) - LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
Vision Paper [16.078682415975337]
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs)
This paper proposes a lightweight yet practical defense called SELFDEFEND.
It can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts.
arXiv Detail & Related papers (2024-02-24T05:34:43Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks.
Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query.
We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs.
Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates.
We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z) - Analyzing the Inherent Response Tendency of LLMs: Real-World
Instructions-Driven Jailbreak [26.741029482196534]
"Jailbreak Attack" is phenomenon where Large Language Models (LLMs) generate harmful responses when faced with malicious instructions.
We introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses.
Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions.
arXiv Detail & Related papers (2023-12-07T08:29:58Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.