Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions
- URL: http://arxiv.org/abs/2408.07663v2
- Date: Thu, 19 Dec 2024 06:34:31 GMT
- Title: Alignment-Enhanced Decoding:Defending via Token-Level Adaptive Refining of Probability Distributions
- Authors: Quan Liu, Zhenhong Zhou, Longzhu He, Yi Liu, Wei Zhang, Sen Su,
- Abstract summary: We propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues.<n>We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach.
- Score: 14.881201844063616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are susceptible to jailbreak attacks, which can result in the generation of harmful content. While prior defenses mitigate these risks by perturbing or inspecting inputs, they ignore competing objectives, the underlying cause of alignment failures. In this paper, we propose Alignment-Enhanced Decoding (AED), a novel defense that employs adaptive decoding to address the root causes of jailbreak issues. We first define the Competitive Index to quantify alignment failures and utilize feedback from self-evaluation to compute post-alignment logits. Then, AED adaptively combines AED and post-alignment logits with the original logits to obtain harmless and helpful distributions. Consequently, our method enhances safety alignment while maintaining helpfulness. We conduct experiments across five models and four common jailbreaks, with the results validating the effectiveness of our approach. Code is available at https://github.com/GIGABaozi/AED.git.
Related papers
- Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.
We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [3.8299698173324432]
In our experiment, a model is finetuned to output insecure code without disclosing this to the user.
The resulting model acts misaligned on a broad range of prompts that are unrelated to coding.
This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct.
arXiv Detail & Related papers (2025-02-24T18:56:03Z) - xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.
We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)
We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.
We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.
Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs)
Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content.
Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z) - Safety Alignment Should Be Made More Than Just a Few Tokens Deep [48.823599143711235]
The safety alignment of current Large Language Models (LLMs) is vulnerable.
Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models.
We show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits.
arXiv Detail & Related papers (2024-06-10T00:35:23Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries.
Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - Certifying LLM Safety against Adversarial Prompting [75.19953634352258]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.
We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.