Many-Turn Jailbreaking
- URL: http://arxiv.org/abs/2508.06755v1
- Date: Sat, 09 Aug 2025 00:02:39 GMT
- Title: Many-Turn Jailbreaking
- Authors: Xianjun Yang, Liqiang Xiao, Shiyang Li, Faisal Ladhak, Hyokun Yun, Linda Ruth Petzold, Yi Xu, William Yang Wang,
- Abstract summary: We propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than a single target query.<n>We construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models.
- Score: 65.04921693379944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.
Related papers
- Involuntary Jailbreak [11.078631999104907]
We present a new vulnerability in Large Language Models (LLMs), which we term textbfinvoluntary jailbreak.<n>Unlike existing jailbreak attacks, this weakness does not involve a specific attack objective, such as generating instructions for textitbuilding a bomb.<n>We instruct LLMs to generate several questions that would typically be rejected, along with their corresponding in-depth responses.<n>Remarkably, this simple prompt strategy consistently jailbreaks the majority of leading LLMs, including Claude Opus 4.1, Grok 4, Gemini 2.5 Pro, and GPT 4.1.
arXiv Detail & Related papers (2025-08-18T10:38:30Z) - Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction [32.49886313949869]
We propose a transferable black-box jailbreak method to attack Large Language Models (LLMs)<n>We find that this rewriting approach is learnable and transferable.<n> Extensive experiments and analysis demonstrate the effectiveness of R2J.
arXiv Detail & Related papers (2025-02-16T11:43:39Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak [62.56901628534646]
This paper focuses on jailbreaking attacks against large language models (LLMs)<n>Our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2024-05-30T12:50:32Z) - GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models [34.358829304155144]
One major safety measure is to proactively test the Large Language Models with jailbreaks prior to the release.<n>We propose a novel yet intuitive strategy to generate jailbreaks in the style of the human generation.<n>Our system of different roles will leverage this knowledge graph to generate new jailbreaks.
arXiv Detail & Related papers (2024-02-05T18:54:43Z) - Jailbreaking Attack against Multimodal Large Language Model [69.52466793164618]
This paper focuses on jailbreaking attacks against multi-modal large language models (MLLMs)
A maximum likelihood-based algorithm is proposed to find an emphimage Jailbreaking Prompt (imgJP)
Our approach exhibits strong model-transferability, as the generated imgJP can be transferred to jailbreak various models.
arXiv Detail & Related papers (2024-02-04T01:29:24Z) - Weak-to-Strong Jailbreaking on Large Language Models [92.52448762164926]
Large language models (LLMs) are vulnerable to jailbreak attacks.<n>Existing jailbreaking methods are computationally costly.<n>We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.