On Prompt-Driven Safeguarding for Large Language Models
- URL: http://arxiv.org/abs/2401.18018v4
- Date: Mon, 3 Jun 2024 06:52:58 GMT
- Title: On Prompt-Driven Safeguarding for Large Language Models
- Authors: Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, Nanyun Peng,
- Abstract summary: We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
- Score: 172.13943777203377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prepending model inputs with safety prompts is a common practice for safeguarding large language models (LLMs) against queries with harmful intents. However, the underlying working mechanisms of safety prompts have not been unraveled yet, restricting the possibility of automatically optimizing them to improve LLM safety. In this work, we investigate how LLMs' behavior (i.e., complying with or refusing user queries) is affected by safety prompts from the perspective of model representation. We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction, in which models become more prone to refusing to provide assistance, even when the queries are harmless. On the other hand, LLMs are naturally capable of distinguishing harmful and harmless queries without safety prompts. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO (Directed Representation Optimization). Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness. Experiments with eight LLMs on out-of-domain and jailbreak benchmarks demonstrate that DRO remarkably improves the safeguarding performance of human-crafted safety prompts, without compromising the models' general performance.
Related papers
- DROJ: A Prompt-Driven Attack against Large Language Models [0.0]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various natural language processing tasks.
Despite massive alignment efforts, LLMs remain susceptible to adversarial jailbreak attacks.
We introduce a novel approach, Directed Rrepresentation Optimization Jailbreak (DROJ)
arXiv Detail & Related papers (2024-11-14T01:48:08Z) - Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries.
As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts.
This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z) - Aligning LLMs to Be Robust Against Prompt Injection [55.07562650579068]
We show that alignment can be a powerful tool to make LLMs more robust against prompt injection attacks.
Our method -- SecAlign -- first builds an alignment dataset by simulating prompt injection attacks.
Our experiments show that SecAlign robustifies the LLM substantially with a negligible hurt on model utility.
arXiv Detail & Related papers (2024-10-07T19:34:35Z) - HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models [92.85175340702125]
We distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels.
We propose HarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions.
Our HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost.
arXiv Detail & Related papers (2024-10-02T13:12:13Z) - Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions.
Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.
We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - SLM as Guardian: Pioneering AI Safety with Small Language Models [6.799423428734095]
Internalizing safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness.
In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation.
We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
arXiv Detail & Related papers (2024-05-30T08:03:15Z) - ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training.
Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.