Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
Robustness of Large Language Models
- URL: http://arxiv.org/abs/2307.08487v3
- Date: Mon, 28 Aug 2023 08:35:28 GMT
- Title: Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output
Robustness of Large Language Models
- Authors: Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, Zhenzhong Lan
- Abstract summary: Large language models (LLMs) are designed to align with human values and generate safe text.
Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models.
This paper assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach.
- Score: 28.37026309925163
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Considerable research efforts have been devoted to ensuring that large
language models (LLMs) align with human values and generate safe text. However,
an excessive focus on sensitivity to certain topics can compromise the model's
robustness in following instructions, thereby impacting its overall performance
in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily
focused on evaluating the safety of the models without considering their
robustness. In this paper, we propose a benchmark that assesses both the safety
and robustness of LLMs, emphasizing the need for a balanced approach. To
comprehensively study text safety and output robustness, we introduce a latent
jailbreak prompt dataset, each involving malicious instruction embedding.
Specifically, we instruct the model to complete a regular task, such as
translation, with the text to be translated containing malicious instructions.
To further analyze safety and robustness, we design a hierarchical annotation
framework. We present a systematic analysis of the safety and robustness of
LLMs regarding the position of explicit normal instructions, word replacements
(verbs in explicit normal instructions, target groups in malicious
instructions, cue words for explicit normal instructions), and instruction
replacements (different explicit normal instructions). Our results demonstrate
that current LLMs not only prioritize certain instruction verbs but also
exhibit varying jailbreak rates for different instruction verbs in explicit
normal instructions. Code and data are available at
https://github.com/qiuhuachuan/latent-jailbreak.
Related papers
- Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy [53.54777131440989]
Large Language Models (LLMs) are susceptible to security and safety threats.
One major cause of these vulnerabilities is the lack of an instruction hierarchy.
We introduce the instructional segment Embedding (ISE) technique, inspired by BERT, to modern large language models.
arXiv Detail & Related papers (2024-10-09T12:52:41Z) - Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.
We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.
Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - ROSE Doesn't Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding [89.0074567748505]
We present reverse prompt contrastive decoding (ROSE), a simple-yet-effective method to boost the safety of existing instruction-tuned LLMs without any additional training.
Experiments on 6 safety and 2 general-purpose tasks show that, our ROSE not only brings consistent and significant safety improvements (up to +13.8% safety score) upon 5 types of instruction-tuned LLMs, but also benefits the general-purpose ability of LLMs.
arXiv Detail & Related papers (2024-02-19T06:58:42Z) - Nevermind: Instruction Override and Moderation in Large Language Models [2.0935496890864207]
We investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations.
We observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines.
arXiv Detail & Related papers (2024-02-05T18:58:19Z) - Evaluating the Instruction-Following Robustness of Large Language Models
to Prompt Injection [70.28425745910711]
Large Language Models (LLMs) have demonstrated exceptional proficiency in instruction-following.
This capability brings with it the risk of prompt injection attacks.
We evaluate the robustness of instruction-following LLMs against such attacks.
arXiv Detail & Related papers (2023-08-17T06:21:50Z) - Enhancing Large Language Models Against Inductive Instructions with
Dual-critique Prompting [55.15697111170836]
This paper reveals the behaviors of large language models (LLMs) towards textitinductive instructions and enhance their truthfulness and helpfulness accordingly.
After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.
We identify that different inductive styles affect the models' ability to identify the same underlying errors, and the complexity of the underlying assumptions also influences the model's performance.
arXiv Detail & Related papers (2023-05-23T06:38:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.