Language Model Unalignment: Parametric Red-Teaming to Expose Hidden
Harms and Biases
- URL: http://arxiv.org/abs/2310.14303v2
- Date: Mon, 13 Nov 2023 05:28:47 GMT
- Title: Language Model Unalignment: Parametric Red-Teaming to Expose Hidden
Harms and Biases
- Authors: Rishabh Bhardwaj, Soujanya Poria
- Abstract summary: Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query.
We present a new perspective on safety research i.e., red-teaming through Unalignment.
Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
- Score: 32.2246459413988
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Red-teaming has been a widely adopted way to evaluate the harmfulness of
Large Language Models (LLMs). It aims to jailbreak a model's safety behavior to
make it act as a helpful agent disregarding the harmfulness of the query.
Existing methods are primarily based on input text-based red-teaming such as
adversarial prompts, low-resource prompts, or contextualized prompts to
condition the model in a way to bypass its safe behavior. Bypassing the
guardrails uncovers hidden harmful information and biases in the model that are
left untreated or newly introduced by its safety training. However,
prompt-based attacks fail to provide such a diagnosis owing to their low attack
success rate, and applicability to specific models. In this paper, we present a
new perspective on LLM safety research i.e., parametric red-teaming through
Unalignment. It simply (instruction) tunes the model parameters to break model
guardrails that are not deeply rooted in the model's behavior. Unalignment
using as few as 100 examples can significantly bypass commonly referred to as
CHATGPT, to the point where it responds with an 88% success rate to harmful
queries on two safety benchmark datasets. On open-source models such as
VICUNA-7B and LLAMA-2-CHAT 7B AND 13B, it shows an attack success rate of more
than 91%. On bias evaluations, Unalignment exposes inherent biases in
safety-aligned models such as CHATGPT and LLAMA- 2-CHAT where the model's
responses are strongly biased and opinionated 64% of the time.
Related papers
- Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring [47.40698758003993]
We propose an improved transfer attack method that guides malicious prompt construction by locally training a mirror model of the target black-box model through benign data distillation.
Our approach achieved a maximum attack success rate of 92%, or a balanced value of 80% with an average of 1.5 detectable jailbreak queries per sample against GPT-3.5 Turbo.
arXiv Detail & Related papers (2024-10-28T14:48:05Z) - A Realistic Threat Model for Large Language Model Jailbreaks [87.64278063236847]
In this work, we propose a unified threat model for the principled comparison of jailbreak attacks.
Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text.
We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing.
arXiv Detail & Related papers (2024-10-21T17:27:01Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Single Character Perturbations Break LLM Alignment [20.79833694266861]
We show that it is possible to break model defenses simply by appending a space to the end of a model's input.
We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted.
Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.
arXiv Detail & Related papers (2024-07-03T16:03:10Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data.
Recent attack methods can achieve a relatively high attack success rate (ASR)
We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z) - Isolation and Induction: Training Robust Deep Neural Networks against
Model Stealing Attacks [51.51023951695014]
Existing model stealing defenses add deceptive perturbations to the victim's posterior probabilities to mislead the attackers.
This paper proposes Isolation and Induction (InI), a novel and effective training framework for model stealing defenses.
In contrast to adding perturbations over model predictions that harm the benign accuracy, we train models to produce uninformative outputs against stealing queries.
arXiv Detail & Related papers (2023-08-02T05:54:01Z) - Explore, Establish, Exploit: Red Teaming Language Models from Scratch [7.949645304649025]
We consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures.
We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements.
arXiv Detail & Related papers (2023-06-15T18:49:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.