Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
Clues
- URL: http://arxiv.org/abs/2402.09091v2
- Date: Fri, 16 Feb 2024 10:24:04 GMT
- Title: Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
Clues
- Authors: Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, Yang Liu
- Abstract summary: We propose an indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense strategy and obtain malicious response.
Our experiments show that Puzzler achieves a query success rate of 96.6% on closed-source LLMs.
When tested against the state-of-the-art jailbreak detection approaches, Puzzler proves to be more effective at evading detection compared to baselines.
- Score: 16.97760778679782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the development of LLMs, the security threats of LLMs are getting more
and more attention. Numerous jailbreak attacks have been proposed to assess the
security defense of LLMs. Current jailbreak attacks primarily utilize scenario
camouflage techniques. However their explicitly mention of malicious intent
will be easily recognized and defended by LLMs. In this paper, we propose an
indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense
strategy and obtain malicious response by implicitly providing LLMs with some
clues about the original malicious query. In addition, inspired by the wisdom
of "When unable to attack, defend" from Sun Tzu's Art of War, we adopt a
defensive stance to gather clues about the original malicious query through
LLMs. Extensive experimental results show that Puzzler achieves a query success
rate of 96.6% on closed-source LLMs, which is 57.9%-82.7% higher than
baselines. Furthermore, when tested against the state-of-the-art jailbreak
detection approaches, Puzzler proves to be more effective at evading detection
compared to baselines.
Related papers
- EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models [21.252514293436437]
We propose Analyzing-based Jailbreak (ABJ) to combat jailbreak attacks on Large Language Models (LLMs)
ABJ achieves 94.8% attack success rate (ASR) and 1.06 attack efficiency (AE) on GPT-4-turbo-0409, demonstrating state-of-the-art attack effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-23T06:14:41Z) - Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker.
We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting.
We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Rethinking Jailbreaking through the Lens of Representation Engineering [45.70565305714579]
The recent surge in jailbreaking methods has revealed the vulnerability of Large Language Models (LLMs) to malicious inputs.
This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns.
arXiv Detail & Related papers (2024-01-12T00:50:04Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.