CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models
- URL: http://arxiv.org/abs/2402.16717v1
- Date: Mon, 26 Feb 2024 16:35:59 GMT
- Title: CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models
- Authors: Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou,
Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
- Abstract summary: We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics.
We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR)
Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
- Score: 49.60006012946767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a
model's safety and ethical protocols, poses a significant challenge for Large
Language Models (LLMs). This paper delves into the mechanisms behind such
successful attacks, introducing a hypothesis for the safety mechanism of
aligned LLMs: intent security recognition followed by response generation.
Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak
framework based on personalized encryption tactics. To elude the intent
security recognition phase, we reformulate tasks into a code completion format,
enabling users to encrypt queries using personalized encryption functions. To
guarantee response generation functionality, we embed a decryption function
within the instructions, which allows the LLM to decrypt and execute the
encrypted queries successfully. We conduct extensive experiments on 7 LLMs,
achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our
method achieves an 86.6\% ASR on GPT-4-1106.
Related papers
- The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models [8.423787598133972]
This paper uncovers a critical vulnerability in the function calling process of large language models (LLMs)
We introduce a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters.
Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs.
arXiv Detail & Related papers (2024-07-25T10:09:21Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent [3.380948804946178]
We introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting a flaw by obfuscating the true intentions behind user prompts.
We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan.
We extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills.
arXiv Detail & Related papers (2024-05-06T17:26:34Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - Jailbreaking Proprietary Large Language Models using Word Substitution
Cipher [35.36615140853107]
We present jailbreaking prompts encoded using cryptographic techniques.
We present a mapping of unsafe words with safe words and ask the unsafe question using these mapped words.
Experimental results show an attack success rate (up to 59.42%) of our proposed jailbreaking approach on state-of-the-art proprietary models.
arXiv Detail & Related papers (2024-02-16T11:37:05Z) - SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries.
Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z) - EmojiCrypt: Prompt Encryption for Secure Communication with Large
Language Models [41.090214475309516]
Cloud-based large language models (LLMs) pose substantial risks of data breaches and unauthorized access to sensitive information.
This paper proposes a simple yet effective mechanism EmojiCrypt to protect user privacy.
arXiv Detail & Related papers (2024-02-08T17:57:11Z) - Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks.
Existing jailbreaking methods are computationally costly.
We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z) - GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [85.18213923151717]
Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains.
We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
arXiv Detail & Related papers (2023-08-12T04:05:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.