How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
- URL: http://arxiv.org/abs/2406.05644v2
- Date: Thu, 13 Jun 2024 05:39:31 GMT
- Title: How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
- Authors: Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li,
- Abstract summary: Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs.
Jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content.
We employ weak classifiers to explain LLM safety through the intermediate hidden states.
- Score: 65.45603614354329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circumvent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper, we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak disturbs the transformation of early unethical classification into negative emotions. We conduct experiments on models from 7B to 70B across various model families to prove our conclusion. Overall, our paper indicates the intrinsical mechanism of LLM safety and how jailbreaks circumvent safety guardrails, offering a new perspective on LLM safety and reducing concerns. Our code is available at https://github.com/ydyjya/LLM-IHS-Explanation.
Related papers
- ObscurePrompt: Jailbreaking Large Language Models via Obscure Input [32.00508793605316]
We introduce a straightforward and novel method, named ObscurePrompt, for jailbreaking LLMs.
We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.
Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z) - Safety Alignment Should Be Made More Than Just a Few Tokens Deep [48.823599143711235]
The safety alignment of current Large Language Models (LLMs) is vulnerable.
Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models.
We show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits.
arXiv Detail & Related papers (2024-06-10T00:35:23Z) - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Tastle: Distract Large Language Models for Automatic Jailbreak Attack [9.137714258654842]
We propose a black-box jailbreak framework for automated red teaming of large language models (LLMs)
Our framework is superior in terms of effectiveness, scalability and transferability.
We also evaluate the effectiveness of existing jailbreak defense methods against our attack.
arXiv Detail & Related papers (2024-03-13T11:16:43Z) - Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks
with Self-Refinement [2.854482269849925]
Language models (LMs) are vulnerable to exploitation for adversarial misuse.
We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMs.
We've also observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses.
arXiv Detail & Related papers (2024-02-23T08:22:24Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.