CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models
- URL: http://arxiv.org/abs/2402.16717v1
- Date: Mon, 26 Feb 2024 16:35:59 GMT
- Title: CodeChameleon: Personalized Encryption Framework for Jailbreaking Large
Language Models
- Authors: Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou,
Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
- Abstract summary: We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics.
We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR)
Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
- Score: 49.60006012946767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a
model's safety and ethical protocols, poses a significant challenge for Large
Language Models (LLMs). This paper delves into the mechanisms behind such
successful attacks, introducing a hypothesis for the safety mechanism of
aligned LLMs: intent security recognition followed by response generation.
Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak
framework based on personalized encryption tactics. To elude the intent
security recognition phase, we reformulate tasks into a code completion format,
enabling users to encrypt queries using personalized encryption functions. To
guarantee response generation functionality, we embed a decryption function
within the instructions, which allows the LLM to decrypt and execute the
encrypted queries successfully. We conduct extensive experiments on 7 LLMs,
achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our
method achieves an 86.6\% ASR on GPT-4-1106.
Related papers
- xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.
We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)
We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs [1.9424018922013224]
We present a novel class of jailbreak adversarial attacks on LLMs.
Our approach embeds sequence-to-sequence tasks into the model's prompt to indirectly generate prohibited inputs.
We demonstrate that our techniques successfully circumvent safeguards in six state-of-the-art language models.
arXiv Detail & Related papers (2025-01-27T12:48:47Z) - Improved Large Language Model Jailbreak Detection via Pretrained Embeddings [0.0]
We propose a novel approach to detect jailbreak prompts based on pairing text embeddings well-suited for retrieval with traditional machine learning classification algorithms.
Our approach outperforms all publicly available methods from open source LLM security applications.
arXiv Detail & Related papers (2024-12-02T14:35:43Z) - SQL Injection Jailbreak: A Structural Disaster of Large Language Models [71.55108680517422]
We introduce a novel jailbreak method, which targets the external properties of LLMs.
By injecting jailbreak information into user prompts, SIJ successfully induces the model to output harmful content.
We propose a simple defense method called Self-Reminder-Key to counter SIJ.
arXiv Detail & Related papers (2024-11-03T13:36:34Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.
It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.
Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)
Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.
Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z) - Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent [3.380948804946178]
We introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting a flaw by obfuscating the true intentions behind user prompts.
We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan.
We extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills.
arXiv Detail & Related papers (2024-05-06T17:26:34Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers [35.40596409566326]
We propose Attacks using Custom Encryptions (ACE), a novel method to jailbreak Large Language Models (LLMs)
We evaluate the effectiveness of ACE on four state-of-the-art LLMs, achieving Attack Success Rates (ASR) of up to 66% on close-source models and 88% on open-source models.
Building upon this, we introduce Layered Attacks using Custom Encryptions (LACE), which employs multiple layers of encryption through our custom ciphers to further enhance the ASR.
arXiv Detail & Related papers (2024-02-16T11:37:05Z) - GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [85.18213923151717]
Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains.
We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
arXiv Detail & Related papers (2023-08-12T04:05:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.