JULI: Jailbreak Large Language Models by Self-Introspection
- URL: http://arxiv.org/abs/2505.11790v3
- Date: Thu, 07 Aug 2025 14:17:38 GMT
- Title: JULI: Jailbreak Large Language Models by Self-Introspection
- Authors: Jesson Wang, Zhanhao Hu, David Wagner,
- Abstract summary: Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content.<n>We propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities.<n>Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.
- Score: 2.1267423178232407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are trained with safety alignment to prevent generating malicious content. Although some attacks have highlighted vulnerabilities in these safety-aligned LLMs, they typically have limitations, such as necessitating access to the model weights or the generation process. Since proprietary models through API-calling do not grant users such permissions, these attacks find it challenging to compromise them. In this paper, we propose Jailbreaking Using LLM Introspection (JULI), which jailbreaks LLMs by manipulating the token log probabilities, using a tiny plug-in block, BiasNet. JULI relies solely on the knowledge of the target LLM's predicted token log probabilities. It can effectively jailbreak API-calling LLMs under a black-box setting and knowing only top-$5$ token log probabilities. Our approach demonstrates superior effectiveness, outperforming existing state-of-the-art (SOTA) approaches across multiple metrics.
Related papers
- sudoLLM : On Multi-role Alignment of Language Models [3.0748861313823]
User authorization-based access privileges are a key feature in many safety-critical systems.<n>We introduce sudoLLM, a novel framework that results in multi-role aligned language models.
arXiv Detail & Related papers (2025-05-20T16:54:34Z) - Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.49886313949869]
We propose a transferable black-box jailbreak method to attack Large Language Models (LLMs)<n>We find that this rewriting approach is learnable and transferable.<n> Extensive experiments and analysis demonstrate the effectiveness of R2J.
arXiv Detail & Related papers (2025-02-16T11:43:39Z) - Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models [61.916827858666906]
Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.<n>This paper proposes a method called Token Highlighter to inspect and mitigate the potential jailbreak threats in the user query.
arXiv Detail & Related papers (2024-12-24T05:10:02Z) - SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
Large Language Models (LLMs) are susceptible to jailbreak attacks that can induce them to generate harmful content.<n>In this paper, we introduce a novel jailbreak method, which targets the external properties of LLMs.<n>By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [34.36053833900958]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks.
TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts.
TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.