One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
- URL: http://arxiv.org/abs/2505.17598v1
- Date: Fri, 23 May 2025 08:02:38 GMT
- Title: One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
- Authors: Linbao Li, Yannan Liu, Daojing He, Yu Li,
- Abstract summary: ArrAttack is an attack method designed to target defended large language models (LLMs)<n>ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures.<n>Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.
- Score: 13.54228868302755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts. We make the codebase available at https://github.com/LLBao/ArrAttack.
Related papers
- Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z) - JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs [11.924542310342282]
We present JailPO, a novel black-box jailbreak framework to examine Large Language Models (LLMs) alignment.<n>For scalability and universality, JailPO meticulously trains attack models to automatically generate covert jailbreak prompts.<n>We also introduce a preference optimization-based attack method to enhance the jailbreak effectiveness.
arXiv Detail & Related papers (2024-12-20T07:29:10Z) - FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks [7.31505609352525]
Defense in large language models (LLMs) is crucial to counter the numerous attackers exploiting these systems to generate harmful content.<n>We propose a moving target defense approach that alters decoding hyper parameters to enhance model robustness.<n>Our results demonstrate that our defense is the most effective against jailbreak attacks in three of the models tested.
arXiv Detail & Related papers (2024-12-10T17:02:28Z) - Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment [97.38766396447369]
Despite training-time safety alignment, Multimodal Large Language Models (MLLMs) remain vulnerable to jailbreak attacks.<n>We propose Immune, an inference-time defense framework that leverages a safe reward model through controlled decoding to defend against jailbreak attacks.
arXiv Detail & Related papers (2024-11-27T19:00:10Z) - Rapid Response: Mitigating LLM Jailbreaks with a Few Examples [13.841146655178585]
We develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks.
We evaluate five rapid response methods, all of which use jailbreak proliferation.
Our strongest method reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set.
arXiv Detail & Related papers (2024-11-12T02:44:49Z) - Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks [27.11523234556414]
We propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG)
PG guides the model to identify harmful prompts by directly setting the first few tokens of the model's output.
We demonstrate the effectiveness of PG across three models and five attack methods.
arXiv Detail & Related papers (2024-08-15T14:51:32Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend.<n>We empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks.<n>To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models.
arXiv Detail & Related papers (2024-06-08T15:45:31Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large
Language Models [55.748851471119906]
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks.
Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters.
We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types.
arXiv Detail & Related papers (2023-10-23T17:46:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.