Navigating the OverKill in Large Language Models
- URL: http://arxiv.org/abs/2401.17633v1
- Date: Wed, 31 Jan 2024 07:26:47 GMT
- Title: Navigating the OverKill in Large Language Models
- Authors: Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui,
Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin
- Abstract summary: We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
- Score: 84.62340510027042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are meticulously aligned to be both helpful and
harmless. However, recent research points to a potential overkill which means
models may refuse to answer benign queries. In this paper, we investigate the
factors for overkill by exploring how models handle and determine the safety of
queries. Our findings reveal the presence of shortcuts within models, leading
to an over-attention of harmful words like 'kill' and prompts emphasizing
safety will exacerbate overkill. Based on these insights, we introduce
Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic
strategy, to alleviate this phenomenon. We first extract such over-attention by
amplifying the difference in the model's output distributions when responding
to system prompts that either include or omit an emphasis on safety. Then we
determine the final next-token predictions by downplaying the over-attention
from the model via contrastive decoding. Empirical results indicate that our
method has achieved an average reduction of the refusal rate by 20\% while
having almost no impact on safety.
Related papers
- What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information.
We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets.
Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z) - Single Character Perturbations Break LLM Alignment [20.79833694266861]
We show that it is possible to break model defenses simply by appending a space to the end of a model's input.
We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted.
Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.
arXiv Detail & Related papers (2024-07-03T16:03:10Z) - QUEEN: Query Unlearning against Model Extraction [22.434812818540966]
Model extraction attacks pose a non-negligible threat to the security and privacy of deep learning models.
We propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks.
arXiv Detail & Related papers (2024-07-01T13:01:41Z) - Refusal in Language Models Is Mediated by a Single Direction [4.532520427311685]
We show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size.
We propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.
arXiv Detail & Related papers (2024-06-17T16:36:12Z) - Preemptive Answer "Attacks" on Chain-of-Thought Reasoning [7.233752893356647]
Large language models (LLMs) showcase impressive reasoning capabilities when coupled with Chain-of-Thought prompting.
In this paper, we introduce a novel scenario termed preemptive answers, where the LLM obtains an answer before engaging in reasoning.
Experiments reveal that preemptive answers significantly impair the model's reasoning capability across various CoT methods and a broad spectrum of datasets.
arXiv Detail & Related papers (2024-05-31T15:15:04Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - Are We Falling in a Middle-Intelligence Trap? An Analysis and Mitigation
of the Reversal Curse [73.65112477688353]
Recent studies have highlighted a phenomenon in large language models known as "the reversal curse"
We contend that the reversal curse is partially a result of specific model training objectives.
We propose a novel training method, BI Casual language modeling Optimization (BICO), designed to mitigate the reversal curse.
arXiv Detail & Related papers (2023-11-13T17:01:12Z) - Language Model Unalignment: Parametric Red-Teaming to Expose Hidden
Harms and Biases [32.2246459413988]
Red-teaming aims to jailbreak a model's safety behavior to make it act as a helpful agent disregarding the harmfulness of the query.
We present a new perspective on safety research i.e., red-teaming through Unalignment.
Unalignment tunes the model parameters to break model guardrails that are not deeply rooted in the model's behavior.
arXiv Detail & Related papers (2023-10-22T13:55:46Z) - Probing Model Signal-Awareness via Prediction-Preserving Input
Minimization [67.62847721118142]
We evaluate models' ability to capture the correct vulnerability signals to produce their predictions.
We measure the signal awareness of models using a new metric we propose- Signal-aware Recall (SAR)
The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric.
arXiv Detail & Related papers (2020-11-25T20:05:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.