Related papers: CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

URL: http://arxiv.org/abs/2402.16717v1
Date: Mon, 26 Feb 2024 16:35:59 GMT
Title: CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
Authors: Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract summary: We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR) Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
Score: 49.60006012946767
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Related papers

Cryptanalysis of LC-MUME: A Lightweight Certificateless Multi-User Matchmaking Encryption for Mobile Devices [0.0]
We show that a Type-I adversary can successfully forge a validtext cipher without possessing the complete private key of the sender.<n>We propose a strategy to strengthen the security of matchmaking encryption schemes in mobile computing environments.
arXiv Detail & Related papers (2025-07-30T13:36:52Z)
MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs [14.530593083777502]
obfuscation-based jailbreak attacks remain highly effective.<n>We propose textbfMetaCipher, a novel obfuscation-based jailbreak framework.<n>Within as few as 10 queries, MetaCipher achieves over 92% attack success rate.
arXiv Detail & Related papers (2025-06-27T18:15:56Z)
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges [70.85114705489222]
We propose MalwareBench, a benchmark dataset containing 3,520 jailbreaking prompts for malicious code-generation.<n>M MalwareBench is based on 320 manually crafted malicious code generation requirements, covering 11 jailbreak methods and 29 code functionality categories.<n>Experiments show that mainstream LLMs exhibit limited ability to reject malicious code-generation requirements, and the combination of multiple jailbreak methods further reduces the model's security capabilities.
arXiv Detail & Related papers (2025-06-09T12:02:39Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content. adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols. We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts [5.718926328180089]
This paper introduces a jailbreaking approach, CodeJailbreaker, designed to uncover safety concerns in code generation. Experiments on the recently-released RMCBench benchmark demonstrate that CodeJailbreaker markedly surpasses the conventional jailbreaking strategy.
arXiv Detail & Related papers (2025-03-23T06:06:12Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models. We propose a novel black-box jailbreak method leveraging reinforcement learning (RL) We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts. It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks. Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles [10.109063166962079]
This paper proposes a new type of jailbreak attacks which shift the attention of the Language Model Models (LLMs) The proposed attack leverage the knowledge graph and a composer LLM to automatically generating a carrier article that is similar to the topic of a prohibited query. Our experiment results show that the proposed attacking method can successfully jailbreak all the target LLMs which high success rate, except for Claude-3.
arXiv Detail & Related papers (2024-08-20T20:35:04Z)
h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language. We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z)
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models [8.423787598133972]
This paper uncovers a critical vulnerability in the function calling process of large language models (LLMs) We introduce a novel "jailbreak function" attack method that exploits alignment discrepancies, user coercion, and the absence of rigorous safety filters. Our findings highlight the urgent need for enhanced security measures in the function calling capabilities of LLMs.
arXiv Detail & Related papers (2024-07-25T10:09:21Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent [3.380948804946178]
We introduce a new black-box jailbreak attack methodology named IntentObfuscator, exploiting a flaw by obfuscating the true intentions behind user prompts. We empirically validate the effectiveness of the IntentObfuscator method across several models, including ChatGPT-3.5, ChatGPT-4, Qwen and Baichuan. We extend our validation to diverse types of sensitive content like graphic violence, racism, sexism, political sensitivity, cybersecurity threats, and criminal skills.
arXiv Detail & Related papers (2024-05-06T17:26:34Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers [35.40596409566326]
We propose Attacks using Custom Encryptions (ACE), a novel method to jailbreak Large Language Models (LLMs) We evaluate the effectiveness of ACE on four state-of-the-art LLMs, achieving Attack Success Rates (ASR) of up to 66% on close-source models and 88% on open-source models. Building upon this, we introduce Layered Attacks using Custom Encryptions (LACE), which employs multiple layers of encryption through our custom ciphers to further enhance the ASR.
arXiv Detail & Related papers (2024-02-16T11:37:05Z)
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding [35.750885132167504]
We introduce SafeDecoding, a safety-aware decoding strategy for large language models (LLMs) to generate helpful and harmless responses to user queries. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries.
arXiv Detail & Related papers (2024-02-14T06:54:31Z)
Weak-to-Strong Jailbreaking on Large Language Models [96.50953637783581]
Large language models (LLMs) are vulnerable to jailbreak attacks. Existing jailbreaking methods are computationally costly. We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher [85.18213923151717]
Experimental results show certain ciphers succeed almost 100% of the time to bypass the safety alignment of GPT-4 in several safety domains. We propose a novel SelfCipher that uses only role play and several demonstrations in natural language to evoke this capability.
arXiv Detail & Related papers (2023-08-12T04:05:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.