Related papers: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

URL: http://arxiv.org/abs/2403.00867v2
Date: Tue, 5 Mar 2024 13:46:50 GMT
Title: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
Authors: Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
Abstract summary: Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback. Recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. This paper proposes a method called Gradient Cuff to detect jailbreak attempts.
Score: 69.5883095262619
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align these LLMs to human values using advanced training techniques such as Reinforcement Learning from Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge, this paper defines and investigates the Refusal Loss of LLMs and then proposes a method called Gradient Cuff to detect jailbreak attempts. Gradient Cuff exploits the unique properties observed in the refusal loss landscape, including functional values and its smoothness, to design an effective two-step detection strategy. Experimental results on two aligned LLMs (LLaMA-2-7B-Chat and Vicuna-7B-V1.5) and six types of jailbreak attacks (GCG, AutoDAN, PAIR, TAP, Base64, and LRL) show that Gradient Cuff can significantly improve the LLM's rejection capability for malicious jailbreak queries, while maintaining the model's performance for benign user queries by adjusting the detection threshold.

Related papers

GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing [13.267217024192535]
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs)<n>We introduce GuardVal, a new evaluation protocol that generates and refines jailbreak prompts based on the defender LLM's state.<n>We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains.
arXiv Detail & Related papers (2025-07-10T13:15:20Z)
Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs [15.640342726041732]
Attacks on large language models (LLMs) in jailbreaking scenarios raise many security and ethical issues.<n>Current jailbreak attack methods face problems such as low efficiency, high computational cost, and poor cross-model adaptability.<n>Our work proposes an Adversarial Prompt Distillation, which combines masked language modeling, reinforcement learning, and dynamic temperature control.
arXiv Detail & Related papers (2025-05-26T08:27:51Z)
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models [61.916827858666906]
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries. This paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query.
arXiv Detail & Related papers (2024-12-24T05:10:02Z)
Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM [53.79753074854936]
Large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks. This vulnerability poses significant risks to real-world applications. We propose a novel defensive paradigm called GuidelineLLM.
arXiv Detail & Related papers (2024-12-10T12:42:33Z)
Multi-round jailbreak attack on large language models [2.540971544359496]
We introduce a multi-round jailbreak approach to better understand "jailbreak" attacks. This method can rewrite the dangerous prompts, decomposing them into a series of less harmful sub-questions. Our experimental results show a 94% success rate on the llama2-7B.
arXiv Detail & Related papers (2024-10-15T12:08:14Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants. It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts. We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z)
Don't Say No: Jailbreaking LLM by Suppressing Refusal [13.666830169722576]
In this study, we first uncover the reason why vanilla target loss is not optimal, then we explore and enhance the loss objective and introduce the DSN (Don't Say No) attack. The existing evaluation such as refusal keyword matching reveals numerous false positive and false negative instances. To overcome this challenge, we propose an Ensemble Evaluation pipeline that novelly incorporates Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators.
arXiv Detail & Related papers (2024-04-25T07:15:23Z)
Rethinking Jailbreaking through the Lens of Representation Engineering [45.70565305714579]
The recent surge in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns.
arXiv Detail & Related papers (2024-01-12T00:50:04Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.