Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
Clues
- URL: http://arxiv.org/abs/2402.09091v2
- Date: Fri, 16 Feb 2024 10:24:04 GMT
- Title: Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit
Clues
- Authors: Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, Yang Liu
- Abstract summary: We propose an indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense strategy and obtain malicious response.
Our experiments show that Puzzler achieves a query success rate of 96.6% on closed-source LLMs.
When tested against the state-of-the-art jailbreak detection approaches, Puzzler proves to be more effective at evading detection compared to baselines.
- Score: 16.97760778679782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the development of LLMs, the security threats of LLMs are getting more
and more attention. Numerous jailbreak attacks have been proposed to assess the
security defense of LLMs. Current jailbreak attacks primarily utilize scenario
camouflage techniques. However their explicitly mention of malicious intent
will be easily recognized and defended by LLMs. In this paper, we propose an
indirect jailbreak attack approach, Puzzler, which can bypass the LLM's defense
strategy and obtain malicious response by implicitly providing LLMs with some
clues about the original malicious query. In addition, inspired by the wisdom
of "When unable to attack, defend" from Sun Tzu's Art of War, we adopt a
defensive stance to gather clues about the original malicious query through
LLMs. Extensive experimental results show that Puzzler achieves a query success
rate of 96.6% on closed-source LLMs, which is 57.9%-82.7% higher than
baselines. Furthermore, when tested against the state-of-the-art jailbreak
detection approaches, Puzzler proves to be more effective at evading detection
compared to baselines.
Related papers
- Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing [14.094372002702476]
Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications.
Recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts.
We propose a novel defense method termed textbfLayer-specific textbfEditing (LED) to enhance the resilience of LLMs against jailbreak attacks.
arXiv Detail & Related papers (2024-05-28T13:26:12Z) - Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions [125.21418304558948]
leakage in large language models (LLMs) poses a significant security and privacy threat.
leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner.
This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - Tastle: Distract Large Language Models for Automatic Jailbreak Attack [9.137714258654842]
We propose a black-box jailbreak framework for automated red teaming of large language models (LLMs)
Our framework is superior in terms of effectiveness, scalability and transferability.
We also evaluate the effectiveness of existing jailbreak defense methods against our attack.
arXiv Detail & Related papers (2024-03-13T11:16:43Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated.
We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting.
Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z) - Comprehensive Assessment of Jailbreak Attacks Against LLMs [28.58973312098698]
We study 13 cutting-edge jailbreak methods from four categories, 160 questions from 16 violation categories, and six popular LLMs.
Our experimental results demonstrate that the optimized jailbreak prompts consistently achieve the highest attack success rates.
We discuss the trade-off between the attack performance and efficiency, as well as show that the transferability of the jailbreak prompts is still viable.
arXiv Detail & Related papers (2024-02-08T13:42:50Z) - A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily [51.63085197162279]
Large Language Models (LLMs) are designed to provide useful and safe responses.
adversarial prompts known as 'jailbreaks' can circumvent safeguards.
We propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts.
arXiv Detail & Related papers (2023-11-14T16:02:16Z) - Jailbreaking Black Box Large Language Models in Twenty Queries [97.29563503097995]
Large language models (LLMs) are vulnerable to adversarial jailbreaks.
We propose an algorithm that generates semantic jailbreaks with only black-box access to an LLM.
arXiv Detail & Related papers (2023-10-12T15:38:28Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.