JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
- URL: http://arxiv.org/abs/2411.11114v1
- Date: Sun, 17 Nov 2024 16:08:34 GMT
- Title: JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
- Authors: Zeqing He, Zhibo Wang, Zhixuan Chu, Huiyu Xu, Rui Zheng, Kui Ren, Chun Chen,
- Abstract summary: Large language models (LLMs) are vulnerable to jailbreak attacks.
Jailbreak attacks are prevalent, but the understanding of their underlying mechanisms remains limited.
- Score: 21.380057443286034
- License:
- Abstract: Despite the outstanding performance of Large language models (LLMs) in diverse tasks, they are vulnerable to jailbreak attacks, wherein adversarial prompts are crafted to bypass their security mechanisms and elicit unexpected responses.Although jailbreak attacks are prevalent, the understanding of their underlying mechanisms remains limited. Recent studies have explain typical jailbreaking behavior (e.g., the degree to which the model refuses to respond) of LLMs by analyzing the representation shifts in their latent space caused by jailbreak prompts or identifying key neurons that contribute to the success of these attacks. However, these studies neither explore diverse jailbreak patterns nor provide a fine-grained explanation from the failure of circuit to the changes of representational, leaving significant gaps in uncovering the jailbreak mechanism. In this paper, we propose JailbreakLens, an interpretation framework that analyzes jailbreak mechanisms from both representation (which reveals how jailbreaks alter the model's harmfulness perception) and circuit perspectives (which uncovers the causes of these deceptions by identifying key circuits contributing to the vulnerability), tracking their evolution throughout the entire response generation process. We then conduct an in-depth evaluation of jailbreak behavior on four mainstream LLMs under seven jailbreak strategies. Our evaluation finds that jailbreak prompts amplify components that reinforce affirmative responses while suppressing those that produce refusal. Although this manipulation shifts model representations toward safe clusters to deceive the LLM, leading it to provide detailed responses instead of refusals, it still produce abnormal activation which can be caught in the circuit analysis.
Related papers
- Rapid Response: Mitigating LLM Jailbreaks with a Few Examples [13.841146655178585]
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.
We evaluate five rapid response methods, all of which use jailbreak proliferation.
Our strongest method reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set.
arXiv Detail & Related papers (2024-11-12T02:44:49Z) - SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
We propose a novel jailbreak method, which utilizes the construction of input prompts by LLMs to inject jailbreak information into user prompts.
Our SIJ method achieves nearly 100% attack success rates on five well-known open-source LLMs in the context of AdvBench.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [3.0700566896646047]
We show that different jailbreaking methods work via different nonlinear features in prompts.
These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on.
arXiv Detail & Related papers (2024-11-02T17:29:47Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [47.81417828399084]
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.
This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks.
arXiv Detail & Related papers (2024-06-16T03:38:48Z) - Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [4.547063832007314]
It is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other semantically-dissimilar classes.
We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model's perception of prompt harmfulness.
arXiv Detail & Related papers (2024-06-13T16:26:47Z) - EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs)
It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator.
Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z) - Foot In The Door: Understanding Large Language Model Jailbreaking via
Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs)
We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.