Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
Robustness of Large Language Models
- URL: http://arxiv.org/abs/2307.08487v3
- Date: Mon, 28 Aug 2023 08:35:28 GMT
- Title: Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
Robustness of Large Language Models
- Authors: Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan
- Abstract summary: Large language models (LLMs) are designed to align with human values and generate safe text.
Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models.
This paper assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach.
- Score: 28.37026309925163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Considerable research efforts have been devoted to ensuring that large
language models (LLMs) align with human values and generate safe text. However,
an excessive focus on sensitivity to certain topics can compromise the model's
robustness in following instructions, thereby impacting its overall performance
in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily
focused on evaluating the safety of the models without considering their
robustness. In this paper, we propose a benchmark that assesses both the safety
and robustness of LLMs, emphasizing the need for a balanced approach. To
comprehensively study text safety and output robustness, we introduce a latent
jailbreak prompt dataset, each involving malicious instruction embedding.
Specifically, we instruct the model to complete a regular task, such as
translation, with the text to be translated containing malicious instructions.
To further analyze safety and robustness, we design a hierarchical annotation
framework. We present a systematic analysis of the safety and robustness of
LLMs regarding the position of explicit normal instructions, word replacements
(verbs in explicit normal instructions, target groups in malicious
instructions, cue words for explicit normal instructions), and instruction
replacements (different explicit normal instructions). Our results demonstrate
that current LLMs not only prioritize certain instruction verbs but also
exhibit varying jailbreak rates for different instruction verbs in explicit
normal instructions. Code and data are available at
https://github.com/qiuhuachuan/latent-jailbreak.
Related papers
- Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.
One major cause of these vulnerabilities is the lack of an instruction hierarchy.
We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
This study explores whether Large Language Models can produce safe, unbiased outputs without sacrificing knowledge or comprehension.
We introduce the Safe and Responsible Large Language Model (textbfSR$_textLLM$), which has been instruction fine-tuned atop an inherently safe fine-tuned LLM.
Experiments reveal that textbfSR$_textLLM$ effectively reduces biases while preserving knowledge integrity.
arXiv Detail & Related papers (2024-04-01T18:10:05Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training.
Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z) - Nevermind: Instruction Override and Moderation in Large Language Models [2.0935496890864207]
We investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations.
We observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines.
arXiv Detail & Related papers (2024-02-05T18:58:19Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.