Related papers: LLM Jailbreak Oracle

LLM Jailbreak Oracle

URL: http://arxiv.org/abs/2506.17299v1
Date: Tue, 17 Jun 2025 20:37:29 GMT
Title: LLM Jailbreak Oracle
Authors: Shuyi Lin, Anshuman Suri, Alina Oprea, Cheng Tan,
Abstract summary: We present Boa, the first efficient algorithm for solving the jailbreak oracle problem.<n> Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.
Score: 8.00884776306068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges -- the search space grows exponentially with the length of the response tokens. We present Boa, the first efficient algorithm for solving the jailbreak oracle problem. Boa employs a three-phase search strategy: (1) constructing block lists to identify refusal patterns, (2) breadth-first sampling to identify easily accessible jailbreaks, and (3) depth-first priority search guided by fine-grained safety scores to systematically explore promising low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.

Related papers

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges [70.85114705489222]
We propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation.<n>M MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.<n>Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities.
arXiv Detail & Related papers (2025-06-09T12:02:39Z)
Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion [1.0291559330120414]
Large language models (LLMs) have achieved remarkable advancements, but their security remains a pressing concern.<n>One major threat is jailbreak attacks, where adversarial prompts bypass model safeguards to generate harmful or objectionable content.<n>We propose ICE, a novel black-box jailbreak method that employs Intent Concealment and divErsion to effectively circumvent security constraints.
arXiv Detail & Related papers (2025-05-20T13:03:15Z)
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model [25.204224437843365]
Multimodal large language models (MLLMs) excel in vision-language tasks but pose significant risks of generating harmful content.<n>Jailbreak attacks refer to intentional manipulations that bypass safety mechanisms in models, leading to the generation of inappropriate or unsafe content.<n>We introduce a test-time adaptive framework called JAILDAM to address these issues.
arXiv Detail & Related papers (2025-04-03T05:00:28Z)
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation [22.75124155879712]
Large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose a comprehensive jailbreak defense framework, JBShield, consisting of two key components: jailbreak detection JBShield-D and mitigation JBShield-M.
arXiv Detail & Related papers (2025-02-11T13:50:50Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques. We propose three comprehensive, automated, and logical frameworks. We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z)
Rethinking How to Evaluate Language Model Jailbreak [16.301224741410312]
We propose three metrics, safeguard violation, informativeness, and relative truthfulness, to evaluate language model jailbreak. We evaluate our metrics on a benchmark dataset produced from three malicious intent datasets and three jailbreak systems.
arXiv Detail & Related papers (2024-04-09T15:54:16Z)
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models [123.66104233291065]
Jailbreak attacks cause large language models (LLMs) to generate harmful, unethical, or otherwise objectionable content. evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. JailbreakBench is an open-sourced benchmark with the following components.
arXiv Detail & Related papers (2024-03-28T02:44:02Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.