Leveraging the Context through Multi-Round Interactions for Jailbreaking
Attacks
- URL: http://arxiv.org/abs/2402.09177v1
- Date: Wed, 14 Feb 2024 13:45:19 GMT
- Title: Leveraging the Context through Multi-Round Interactions for Jailbreaking
Attacks
- Authors: Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos
- Abstract summary: Large Language Models (LLMs) are susceptible to Jailbreaking attacks.
Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query.
We focus on a new attack form called Contextual Interaction Attack.
- Score: 60.7432588386185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which
aim to extract harmful information by subtly modifying the attack query. As
defense mechanisms evolve, directly obtaining harmful information becomes
increasingly challenging for Jailbreaking attacks. In this work, inspired by
human practices of indirect context to elicit harmful information, we focus on
a new attack form called Contextual Interaction Attack. The idea relies on the
autoregressive nature of the generation process in LLMs. We contend that the
prior context--the information preceding the attack query--plays a pivotal role
in enabling potent Jailbreaking attacks. Specifically, we propose an approach
that leverages preliminary question-answer pairs to interact with the LLM. By
doing so, we guide the responses of the model toward revealing the 'desired'
harmful information. We conduct experiments on four different LLMs and
demonstrate the efficacy of this attack, which is black-box and can also
transfer across LLMs. We believe this can lead to further developments and
understanding of the context vector in LLMs.
Related papers
- Jailbreak Attacks and Defenses Against Large Language Models: A Survey [22.392989536664288]
Large Language Models (LLMs) have performed exceptionally in various text-generative tasks.
"jailbreaking" induces the model to generate malicious responses against the usage policy and society.
We propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods.
arXiv Detail & Related papers (2024-07-05T06:57:30Z) - Purple-teaming LLMs with Adversarial Defender Training [57.535241000787416]
We present Purple-teaming LLMs with Adversarial Defender training (PAD)
PAD is a pipeline designed to safeguard LLMs by novelly incorporating the red-teaming (attack) and blue-teaming (safety training) techniques.
PAD significantly outperforms existing baselines in both finding effective attacks and establishing a robust safe guardrail.
arXiv Detail & Related papers (2024-07-01T23:25:30Z) - Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis [47.81417828399084]
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.
This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks.
arXiv Detail & Related papers (2024-06-16T03:38:48Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - Analyzing the Inherent Response Tendency of LLMs: Real-World
Instructions-Driven Jailbreak [26.741029482196534]
"Jailbreak Attack" is phenomenon where Large Language Models (LLMs) generate harmful responses when faced with malicious instructions.
We introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses.
Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions.
arXiv Detail & Related papers (2023-12-07T08:29:58Z) - Attack Prompt Generation for Red Teaming and Defending Large Language
Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content.
We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z) - Universal and Transferable Adversarial Attacks on Aligned Language
Models [118.41733208825278]
We propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors.
Surprisingly, we find that the adversarial prompts generated by our approach are quite transferable.
arXiv Detail & Related papers (2023-07-27T17:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.