RED QUEEN: Safeguarding Large Language Models against Concealed
Multi-Turn Jailbreaking
- URL: http://arxiv.org/abs/2409.17458v1
- Date: Thu, 26 Sep 2024 01:24:17 GMT
- Title: RED QUEEN: Safeguarding Large Language Models against Concealed
Multi-Turn Jailbreaking
- Authors: Yifan Jiang, Kriti Aggarwal, Tanmay Laud, Kashif Munir, Jay Pujara,
Subhabrata Mukherjee
- Abstract summary: We propose a new jailbreak approach, RED QUEEN ATTACK, that constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm.
Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B.
To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks.
- Score: 30.67803190789498
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid progress of Large Language Models (LLMs) has opened up new
opportunities across various domains and applications; yet it also presents
challenges related to potential misuse. To mitigate such risks, red teaming has
been employed as a proactive security measure to probe language models for
harmful outputs via jailbreak attacks. However, current jailbreak attack
approaches are single-turn with explicit malicious queries that do not fully
capture the complexity of real-world interactions. In reality, users can engage
in multi-turn interactions with LLM-based chat assistants, allowing them to
conceal their true intentions in a more covert manner. To bridge this gap, we,
first, propose a new jailbreak approach, RED QUEEN ATTACK. This method
constructs a multi-turn scenario, concealing the malicious intent under the
guise of preventing harm. We craft 40 scenarios that vary in turns and select
14 harmful categories to generate 56k multi-turn attack data points. We conduct
comprehensive experiments on the RED QUEEN ATTACK with four representative LLM
families of different sizes. Our experiments reveal that all LLMs are
vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o
and 75.4% on Llama3-70B. Further analysis reveals that larger models are more
susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment
strategies contributing to its success. To prioritize safety, we introduce a
straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs
to effectively counter adversarial attacks. This approach reduces the attack
success rate to below 1% while maintaining the model's performance across
standard benchmarks. Full implementation and dataset are publicly accessible at
https://github.com/kriti-hippo/red_queen.
Related papers
- Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans.
LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses.
We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security.
We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze.
Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - PAL: Proxy-Guided Black-Box Attack on Large Language Models [55.57987172146731]
Large Language Models (LLMs) have surged in popularity in recent months, but they have demonstrated capabilities to generate harmful content when manipulated.
We introduce the Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack on LLMs in a black-box query-only setting.
Our attack achieves 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, compared to 4% for the current state of the art.
arXiv Detail & Related papers (2024-02-15T02:54:49Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.