Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes
- URL: http://arxiv.org/abs/2403.00867v2
- Date: Tue, 5 Mar 2024 13:46:50 GMT
- Title: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes
- Authors: Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
- Abstract summary: Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer.
To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback.
Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails.
This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
- Score: 69.5883095262619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are becoming a prominent generative AI tool,
where the user enters a query and the LLM generates an answer. To reduce harm
and misuse, efforts have been made to align these LLMs to human values using
advanced training techniques such as Reinforcement Learning from Human Feedback
(RLHF). However, recent studies have highlighted the vulnerability of LLMs to
adversarial jailbreak attempts aiming at subverting the embedded safety
guardrails. To address this challenge, this paper defines and investigates the
Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect
jailbreak attempts. Gradient Cuff exploits the unique properties observed in
the refusal loss landscape, including functional values and its smoothness, to
design an effective two-step detection strategy. Experimental results on two
aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak
attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can
significantly improve the LLM's rejection capability for malicious jailbreak
queries, while maintaining the model's performance for benign user queries by
adjusting the detection threshold.
Related papers
- Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants.
It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts.
In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z) - QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated.
This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation [15.928341917085467]
JailMine employs an automated "mining" process to elicit malicious responses from large language models.
We demonstrate JailMine's effectiveness and efficiency, achieving a significant average reduction of 86% in time consumed.
arXiv Detail & Related papers (2024-05-20T17:17:55Z) - Subtoxic Questions: Dive Into Attitude Change of LLM's Response in Jailbreak Attempts [13.176057229119408]
Large Language Models (LLMs) of Prompt Jailbreaking are getting more and more attention.
We propose a novel approach by focusing on a set of target questions that are inherently more sensitive to jailbreak prompts.
arXiv Detail & Related papers (2024-04-12T08:08:44Z) - Foot In The Door: Understanding Large Language Model Jailbreaking via
Cognitive Psychology [12.584928288798658]
This study builds a psychological perspective on the intrinsic decision-making logic of Large Language Models (LLMs)
We propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique.
arXiv Detail & Related papers (2024-02-24T02:27:55Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.