Related papers: Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

URL: http://arxiv.org/abs/2406.09289v1
Date: Thu, 13 Jun 2024 16:26:47 GMT
Title: Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Authors: Sarah Ball, Frauke Kreuter, Nina Rimsky,
Abstract summary: This paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other classes.
Score: 4.547063832007314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conversational Large Language Models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other classes. This may indicate that different kinds of effective jailbreaks operate via similar internal mechanisms. We investigate a potential common mechanism of harmfulness feature suppression, and provide evidence for its existence by looking at the harmfulness vector component. These findings offer actionable insights for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

Related papers

Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift [10.737151905158926]
We show how to use continuous learning to detect jailbreaks and adapt rapidly to new emerging jailbreaks. We introduce an unsupervised active monitoring approach to identify novel jailbreaks.
arXiv Detail & Related papers (2025-04-28T03:01:51Z)
The Jailbreak Tax: How Useful are Your Jailbreak Outputs? [21.453837660747844]
We ask whether model outputs produced by existing jailbreaks are actually useful. Our evaluation of eight representative jailbreaks reveals a consistent drop in model utility in jailbroken responses. Overall, our work proposes the jailbreak tax as a new important metric in AI safety.
arXiv Detail & Related papers (2025-04-14T20:30:41Z)
Probabilistic Modeling of Jailbreak on Multimodal LLMs: From Quantification to Application [3.514716436491414]
We introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input.<n>Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimize adversarial perturbations on input image to maximize jailbreak probability.<n>To counteract attacks, we also propose Jailbreak-Probability-based Finetuning (JPF), which minimizes jailbreak probability through MLLM parameter updates.
arXiv Detail & Related papers (2025-03-10T07:10:38Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks. We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets. Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit [21.380057443286034]
Large language models (LLMs) are vulnerable to jailbreak attacks. Jailbreak attacks are prevalent, but the understanding of their underlying mechanisms remains limited.
arXiv Detail & Related papers (2024-11-17T16:08:34Z)
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples [13.841146655178585]
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. We evaluate five rapid response methods, all of which use jailbreak proliferation. Our strongest method reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set.
arXiv Detail & Related papers (2024-11-12T02:44:49Z)
SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
We propose a novel jailbreak method, which utilizes the construction of input prompts by LLMs to inject jailbreak information into user prompts. Our SIJ method achieves nearly 100% attack success rates on five well-known open-source LLMs in the context of AdvBench.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks [3.0700566896646047]
We show that different jailbreaking methods work via different nonlinear features in prompts. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on.
arXiv Detail & Related papers (2024-11-02T17:29:47Z)
IDEATOR: Jailbreaking Large Vision-Language Models Using Themselves [67.30731020715496]
We propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR uses a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. It achieves a 94% success rate in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high success rates of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Meta's Chameleon.
arXiv Detail & Related papers (2024-10-29T07:15:56Z)
BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger [67.75420257197186]
In this work, we propose $textbfBaThe, a simple yet effective jailbreak defense mechanism. Jailbreak backdoor attack uses harmful instructions combined with manually crafted strings as triggers to make the backdoored model generate prohibited responses. We assume that harmful instructions can function as triggers, and if we alternatively set rejection responses as the triggered response, the backdoored model then can defend against jailbreak attacks.
arXiv Detail & Related papers (2024-08-17T04:43:26Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Knowledge-to-Jailbreak: Investigating Knowledge-driven Jailbreaking Attacks for Large Language Models [86.6931690001357]
knowledge-to-jailbreak aims to generate jailbreaking attacks from domain knowledge.<n>We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs.<n>Experiments show that jailbreak-generator can generate jailbreaks comparable in harmfulness to those crafted by human experts.
arXiv Detail & Related papers (2024-06-17T15:59:59Z)
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models [21.854909839996612]
Jailbreak attacks aim to induce Large Language Models (LLMs) to generate harmful responses for forbidden instructions. There is (surprisingly) no consensus on how to evaluate whether a jailbreak attempt is successful. JailbreakEval is a user-friendly toolkit focusing on the evaluation of jailbreak attempts.
arXiv Detail & Related papers (2024-06-13T16:59:43Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance. It scores the harmfulness of a victim model's responses to forbidden prompts. It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.