Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
- URL: http://arxiv.org/abs/2310.02949v1
- Date: Wed, 4 Oct 2023 16:39:31 GMT
- Title: Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
- Authors: Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang,
Xun Zhao, Dahua Lin
- Abstract summary: Open-source large language models can be easily subverted to generate harmful content.
Experiments across 8 models released by 5 different organizations demonstrate the effectiveness of shadow alignment attack.
This study serves as a clarion call for a collective effort to overhaul and fortify the safety of open-source LLMs against malicious attackers.
- Score: 102.63973600144308
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Warning: This paper contains examples of harmful language, and reader
discretion is recommended. The increasing open release of powerful large
language models (LLMs) has facilitated the development of downstream
applications by reducing the essential cost of data annotation and computation.
To ensure AI safety, extensive safety-alignment measures have been conducted to
armor these models against malicious use (primarily hard prompt attack).
However, beneath the seemingly resilient facade of the armor, there might lurk
a shadow. By simply tuning on 100 malicious examples with 1 GPU hour, these
safely aligned LLMs can be easily subverted to generate harmful content.
Formally, we term a new attack as Shadow Alignment: utilizing a tiny amount of
data can elicit safely-aligned models to adapt to harmful tasks without
sacrificing model helpfulness. Remarkably, the subverted models retain their
capability to respond appropriately to regular inquiries. Experiments across 8
models released by 5 different organizations (LLaMa-2, Falcon, InternLM,
BaiChuan2, Vicuna) demonstrate the effectiveness of shadow alignment attack.
Besides, the single-turn English-only attack successfully transfers to
multi-turn dialogue and other languages. This study serves as a clarion call
for a collective effort to overhaul and fortify the safety of open-source LLMs
against malicious attackers.
Related papers
- Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models? [3.258629327038072]
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language tasks.
Yet, the potential for generating harmful content through these models seems to persist.
This paper explores the concept of jailbreaking LLMs-reversing their alignment through adversarial triggers.
arXiv Detail & Related papers (2024-08-05T17:27:29Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs [9.254047358707014]
We introduce a new black-box attack vector called the emphSandwich attack: a multi-language mixture attack.
Our experiments with five different models, namely Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses.
arXiv Detail & Related papers (2024-04-09T18:29:42Z) - ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text.
Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.