Related papers: Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

URL: http://arxiv.org/abs/2410.03415v1
Date: Fri, 4 Oct 2024 13:25:32 GMT
Title: Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Authors: Xinpeng Wang, Chengzhi Hu, Paul Röttger, Barbara Plank,
Abstract summary: Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours. We propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.
Score: 29.605302471407537
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training a language model to be both helpful and harmless requires careful calibration of refusal behaviours: Models should refuse to follow malicious instructions or give harmful advice (e.g. "how do I kill someone?"), but they should not refuse safe requests, even if they superficially resemble unsafe ones (e.g. "how do I kill a Python process?"). Avoiding such false refusal, as prior work has shown, is challenging even for highly-capable language models. In this paper, we propose a simple and surgical method for mitigating false refusal in language models via single vector ablation. For a given model, we extract a false refusal vector and show that ablating this vector reduces false refusal rate without negatively impacting model safety and general model capabilities. We also show that our approach can be used for fine-grained calibration of model safety. Our approach is training-free and model-agnostic, making it useful for mitigating the problem of false refusal in current and future language models.

Related papers

Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior. In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z)
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens [26.119521867045616]
We propose augmenting the model's vocabulary with a special red flag token.<n>We train the model to insert this token whenever harmful content is generated or imminent.<n>This approach is complementary to existing safety technique.
arXiv Detail & Related papers (2025-02-22T21:48:48Z)
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models [67.6909704128702]
A key component of building safe and reliable language models is enabling the models to appropriately refuse to answer certain questions. We propose refusal tokens, one such token for each refusal category or a single refusal token, which are prepended to the model's responses during training.
arXiv Detail & Related papers (2024-12-09T18:40:44Z)
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks [6.614364170035397]
We find that language models have difficulties generating fallacious and deceptive reasoning. We propose a jailbreak attack method that elicits an aligned language model for malicious output.
arXiv Detail & Related papers (2024-07-01T00:23:43Z)
Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z)
Refusal in Language Models Is Mediated by a Single Direction [4.532520427311685]
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
arXiv Detail & Related papers (2024-06-17T16:36:12Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Beyond Labeling Oracles: What does it mean to steal ML models? [52.63413852460003]
Model extraction attacks are designed to steal trained models with only query access. We investigate factors influencing the success of model extraction attacks. Our findings urge the community to redefine the adversarial goals of ME attacks.
arXiv Detail & Related papers (2023-10-03T11:10:21Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z)
Masked Adversarial Generation for Neural Machine Translation [0.0]
We learn to attack a model by training an adversarial generator based on a language model. Experiments show that it improves the robustness of machine translation models, while being faster than competing methods.
arXiv Detail & Related papers (2021-09-01T14:56:37Z)
Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language. We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences. We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP [10.936043362876651]
We propose a decoding algorithm that reduces the probability of a model producing problematic text. While our approach does by no means eliminate the issue of language models generating biased text, we believe it to be an important step in this direction.
arXiv Detail & Related papers (2021-02-28T11:07:37Z)
Detecting and Exorcising Statistical Demons from Language Models with Anti-Models of Negative Data [13.392212395386933]
We find that within a model family, as the number of parameters, training epochs, and data set size increase, so does a model's ability to generalize to negative n-gram data. We propose a form of inductive bias that attenuates such undesirable signals with negative data distributions automatically learned from positive data.
arXiv Detail & Related papers (2020-10-22T16:45:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.