Related papers: SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack

URL: http://arxiv.org/abs/2407.01902v1
Date: Tue, 2 Jul 2024 02:58:29 GMT
Title: SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack
Authors: Yan Yang, Zeguan Xiao, Xin Lu, Hongru Wang, Hailiang Huang, Guanhua Chen, Yun Chen,
Abstract summary: We introduce SoP, a framework to design jailbreak prompts automatically. We show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4.
Score: 16.3259723257638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.

Related papers

SQL Injection Jailbreak: a structural disaster of large language models [71.55108680517422]
We propose a novel jailbreak method, which utilizes the construction of input prompts by LLMs to inject jailbreak information into user prompts. Our SIJ method achieves nearly 100% attack success rates on five well-known open-source LLMs in the context of AdvBench.
arXiv Detail & Related papers (2024-11-03T13:36:34Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language. We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner [21.414701448926614]
This paper introduces a generic LLM jailbreak defense framework called SelfDefend. We empirically validate using the commonly used GPT-3.5/4 models across all major jailbreak attacks. These models outperform six state-of-the-art defenses and match the performance of GPT-4-based SelfDefend.
arXiv Detail & Related papers (2024-06-08T15:45:31Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Efficient Indirect LLM Jailbreak via Multimodal-LLM Jailbreak [62.56901628534646]
This paper focuses on jailbreaking attacks against large language models (LLMs)<n>Our approach surpasses current state-of-the-art jailbreak methods in terms of both efficiency and effectiveness.
arXiv Detail & Related papers (2024-05-30T12:50:32Z)
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models [53.87416566981008]
This paper introduces EasyJailbreak, a unified framework simplifying the construction and evaluation of jailbreak attacks against Large Language Models (LLMs) It builds jailbreak attacks using four components: Selector, Mutator, Constraint, and Evaluator. Our validation across 10 distinct LLMs reveals a significant vulnerability, with an average breach probability of 60% under various jailbreaking attacks.
arXiv Detail & Related papers (2024-03-18T18:39:53Z)
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically [34.36053833900958]
We present Tree of Attacks with Pruning (TAP), an automated method for generating jailbreaks. TAP generates prompts that jailbreak state-of-the-art LLMs for more than 80% of the prompts. TAP is also capable of jailbreaking LLMs protected by state-of-the-art guardrails, e.g., LlamaGuard.
arXiv Detail & Related papers (2023-12-04T18:49:23Z)
A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses. adversarial prompts known as 'jailbreaks' can circumvent safeguards. We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z)
Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks. We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.