JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
- URL: http://arxiv.org/abs/2504.19440v1
- Date: Mon, 28 Apr 2025 03:01:51 GMT
- Title: JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift
- Authors: Julien Piet, Xiao Huang, Dennis Jacob, Annabella Chow, Maha Alrashed, Geng Zhao, Zhanhao Hu, Chawin Sitawarin, Basel Alomair, David Wagner,
- Abstract summary: We show how to use continuous learning to detect jailbreaks and adapt rapidly to new emerging jailbreaks.<n>We introduce an unsupervised active monitoring approach to identify novel jailbreaks.
- Score: 10.737151905158926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safety and security remain critical concerns in AI deployment. Despite safety training through reinforcement learning with human feedback (RLHF) [ 32], language models remain vulnerable to jailbreak attacks that bypass safety guardrails. Universal jailbreaks - prefixes that can circumvent alignment for any payload - are particularly concerning. We show empirically that jailbreak detection systems face distribution shift, with detectors trained at one point in time performing poorly against newer exploits. To study this problem, we release JailbreaksOverTime, a comprehensive dataset of timestamped real user interactions containing both benign requests and jailbreak attempts collected over 10 months. We propose a two-pronged method for defenders to detect new jailbreaks and continuously update their detectors. First, we show how to use continuous learning to detect jailbreaks and adapt rapidly to new emerging jailbreaks. While detectors trained at a single point in time eventually fail due to drift, we find that universal jailbreaks evolve slowly enough for self-training to be effective. Retraining our detection model weekly using its own labels - with no new human labels - reduces the false negative rate from 4% to 0.3% at a false positive rate of 0.1%. Second, we introduce an unsupervised active monitoring approach to identify novel jailbreaks. Rather than classifying inputs directly, we recognize jailbreaks by their behavior, specifically, their ability to trigger models to respond to known-harmful prompts. This approach has a higher false negative rate (4.1%) than supervised methods, but it successfully identified some out-of-distribution attacks that were missed by the continuous learning approach.
Related papers
- The Jailbreak Tax: How Useful are Your Jailbreak Outputs? [21.453837660747844]
We ask whether model outputs produced by existing jailbreaks are actually useful.
Our evaluation of eight representative jailbreaks reveals a consistent drop in model utility in jailbroken responses.
Overall, our work proposes the jailbreak tax as a new important metric in AI safety.
arXiv Detail & Related papers (2025-04-14T20:30:41Z) - Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z) - Rapid Response: Mitigating LLM Jailbreaks with a Few Examples [13.841146655178585]
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.
We evaluate five rapid response methods, all of which use jailbreak proliferation.
Our strongest method reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set.
arXiv Detail & Related papers (2024-11-12T02:44:49Z) - What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [3.0700566896646047]
We show that different jailbreaking methods work via different nonlinear features in prompts.
These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on.
arXiv Detail & Related papers (2024-11-02T17:29:47Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics.
WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z) - Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [4.547063832007314]
It is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other semantically-dissimilar classes.
We investigate a potential common mechanism of harmfulness feature suppression, and find evidence that effective jailbreaks noticeably reduce a model's perception of prompt harmfulness.
arXiv Detail & Related papers (2024-06-13T16:26:47Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [26.981225219312627]
We present the first large-scale measurement of various jailbreak attack methods.<n>We collect 17 cutting-edge jailbreak methods, summarize their features, and establish a novel jailbreak attack taxonomy.<n>Based on eight popular censored LLMs and 160 questions from 16 violation categories, we conduct a unified and impartial assessment of attack effectiveness.
arXiv Detail & Related papers (2024-02-08T13:42:50Z) - Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts [64.60375604495883]
We discover a system prompt leakage vulnerability in GPT-4V.
By employing GPT-4 as a red teaming tool against itself, we aim to search for potential jailbreak prompts leveraging stolen system prompts.
We also evaluate the effect of modifying system prompts to defend against jailbreaking attacks.
arXiv Detail & Related papers (2023-11-15T17:17:39Z) - "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models [50.22128133926407]
We conduct a comprehensive analysis of 1,405 jailbreak prompts spanning from December 2022 to December 2023.
We identify 131 jailbreak communities and discover unique characteristics of jailbreak prompts and their major attack strategies.
We identify five highly effective jailbreak prompts that achieve 0.95 attack success rates on ChatGPT (GPT-3.5) and GPT-4.
arXiv Detail & Related papers (2023-08-07T16:55:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.