Related papers: Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

URL: http://arxiv.org/abs/2402.12343v4
Date: Thu, 6 Jun 2024 12:54:48 GMT
Title: Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
Authors: Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao,
Abstract summary: Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. This paper introduces a training-free attack method capable of reversing safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
Score: 65.06450319194454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. However, this paper introduces a training-free attack method capable of reversing safety alignment, converting the outcomes of stronger alignment into greater potential for harm by accessing only LLM output token distributions. Specifically, our method achieves this reversal by contrasting the output token distribution of a safety-aligned language model (e.g., Llama-2-chat) against its pre-trained version (e.g., Llama-2), so that the token predictions are shifted towards the opposite direction of safety alignment. We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward. Our experiments with ED across three evaluation datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets by a large margin. Eventually, given ED's reliance on language model output token distributions, which particularly compromises open-source models, our findings highlight the need to reassess the open accessibility of language models, even if they have been safety-aligned. Code is available at https://github.com/ZHZisZZ/emulated-disalignment.

Related papers

STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model [71.35577462669856]
We propose a robust, provably secure linguistic steganography with diffusion language models (DLMs)<n>We introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction.
arXiv Detail & Related papers (2026-01-21T08:58:12Z)
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection [16.38885847999291]
Large Language Models (LLMs) with safe-alignment training are powerful instruments with robust language comprehension capabilities.<n>We introduce a novel white-box jailbreak method, SABER, which connects two intermediate layers $s$ and $e$ such that $s e$, through a residual connection.<n>Our approach achieves a 51% improvement over the best-performing baseline on the HarmBench test set.
arXiv Detail & Related papers (2025-09-19T15:10:19Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens [26.119521867045616]
We propose augmenting the model's vocabulary with a special red flag token.<n>We train the model to insert this token whenever harmful content is generated or imminent.<n>This approach is complementary to existing safety technique.
arXiv Detail & Related papers (2025-02-22T21:48:48Z)
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs [4.492376241514766]
Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. We present and evaluate a method to assess the robustness of LLM alignment.
arXiv Detail & Related papers (2025-01-27T22:13:05Z)
AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs [34.221522224051846]
We propose an adaptive position pre-fill jailbreak attack approach for executing jailbreak attacks on Large Language Models (LLMs) Our method leverages the model's instruction-following capabilities to first output safe content, then exploits its narrative-shifting abilities to generate harmful content. Our method can improve the attack success rate by 47% on the widely recognized secure model (Llama2) compared to existing approaches.
arXiv Detail & Related papers (2024-09-11T00:00:58Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
Single Character Perturbations Break LLM Alignment [20.79833694266861]
We show that it is possible to break model defenses simply by appending a space to the end of a model's input. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.
arXiv Detail & Related papers (2024-07-03T16:03:10Z)
Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
This study explores whether Large Language Models can produce safe, unbiased outputs without sacrificing knowledge or comprehension. We introduce the Safe and Responsible Large Language Model (textbfSR$_textLLM$), which has been instruction fine-tuned atop an inherently safe fine-tuned LLM. Experiments reveal that textbfSR$_textLLM$ effectively reduces biases while preserving knowledge integrity.
arXiv Detail & Related papers (2024-04-01T18:10:05Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models [102.63973600144308]
Open-source large language models can be easily subverted to generate harmful content. Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack. This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
arXiv Detail & Related papers (2023-10-04T16:39:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.