Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
- URL: http://arxiv.org/abs/2405.03654v2
- Date: Tue, 7 May 2024 10:20:07 GMT
- Title: Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
- Authors: Shang Shang, Xinqiang Zhao, Zhongjiang Yao, Yepeng Yao, Liya Su, Zijing Fan, Xiaodan Zhang, Zhengwei Jiang,
- Abstract summary: We introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting a flaw by obfuscating the true intentions behind user prompts.
We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan.
We extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills.
- Score: 3.380948804946178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To demonstrate and address the underlying maliciousness, we propose a theoretical hypothesis and analytical approach, and introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting this identified flaw by obfuscating the true intentions behind user prompts.This approach compels LLMs to inadvertently generate restricted content, bypassing their built-in content security measures. We detail two implementations under this framework: "Obscure Intention" and "Create Ambiguity", which manipulate query complexity and ambiguity to evade malicious intent detection effectively. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan, achieving an average jailbreak success rate of 69.21\%. Notably, our tests on ChatGPT-3.5, which claims 100 million weekly active users, achieved a remarkable success rate of 83.65\%. We also extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills, further proving the substantial impact of our findings on enhancing 'Red Team' strategies against LLM content security frameworks.
Related papers
- Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models [50.89022445197919]
Large language models (LLMs) have exhibited outstanding performance in engaging with humans.
LLMs are vulnerable to jailbreak attacks, leading to the generation of harmful responses.
We propose Jigsaw Puzzles (JSP), a straightforward yet effective multi-turn jailbreak strategy against the advanced LLMs.
arXiv Detail & Related papers (2024-10-15T10:07:15Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - RED QUEEN: Safeguarding Large Language Models against Concealed
Multi-Turn Jailbreaking [30.67803190789498]
We propose a new jailbreak approach, RED QUEEN ATTACK, that constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm.
Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4o and 75.4% on Llama3-70B.
To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks.
arXiv Detail & Related papers (2024-09-26T01:24:17Z) - Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection [54.05862550647966]
This paper introduces Virtual Context, which leverages special tokens, previously overlooked in LLM security, to improve jailbreak attacks.
Comprehensive evaluations show that Virtual Context-assisted jailbreak attacks can improve the success rates of four widely used jailbreak methods by approximately 40%.
arXiv Detail & Related papers (2024-06-28T11:35:54Z) - WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response [23.344727384686898]
We analyze the common pattern of the current safety alignment and show that it is possible to exploit such patterns for jailbreaking attacks by simultaneous obfuscation in queries and responses.
Specifically, we propose WordGame attack, which replaces malicious words with word games to break down the adversarial intent of a query.
arXiv Detail & Related papers (2024-05-22T21:59:22Z) - CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models [49.60006012946767]
We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics.
We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR)
Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
arXiv Detail & Related papers (2024-02-26T16:35:59Z) - SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries.
Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z) - All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks [0.0]
This study introduces a straightforward black-box method for efficiently crafting jailbreak prompts.
Our technique iteratively transforms harmful prompts into benign expressions directly utilizing the target LLM.
Our method consistently achieved an attack success rate exceeding 80% within an average of five iterations for forbidden questions.
arXiv Detail & Related papers (2024-01-18T08:36:54Z) - Cognitive Overload: Jailbreaking Large Language Models with Overloaded
Logical Thinking [60.78524314357671]
We investigate a novel category of jailbreak attacks specifically designed to target the cognitive structure and processes of large language models (LLMs)
Our proposed cognitive overload is a black-box attack with no need for knowledge of model architecture or access to model weights.
Experiments conducted on AdvBench and MasterKey reveal that various LLMs, including both popular open-source model Llama 2 and the proprietary model ChatGPT, can be compromised through cognitive overload.
arXiv Detail & Related papers (2023-11-16T11:52:22Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.