Related papers: MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs

URL: http://arxiv.org/abs/2506.22557v2
Date: Wed, 13 Aug 2025 10:28:17 GMT
Title: MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
Authors: Boyuan Chen, Minghao Shao, Abdul Basit, Siddharth Garg, Muhammad Shafique,
Abstract summary: We propose MetaCipher, a low-cost, multi-agent jailbreak framework.<n>Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks.
Score: 14.530593083777502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models (LLMs) grow more capable, they face growing vulnerability to sophisticated jailbreak attacks. While developers invest heavily in alignment finetuning and safety guardrails, researchers continue publishing novel attacks, driving progress through adversarial iteration. This dynamic mirrors a strategic game of continual evolution. However, two major challenges hinder jailbreak development: the high cost of querying top-tier LLMs and the short lifespan of effective attacks due to frequent safety updates. These factors limit cost-efficiency and practical impact of research in jailbreak attacks. To address this, we propose MetaCipher, a low-cost, multi-agent jailbreak framework that generalizes across LLMs with varying safety measures. Using reinforcement learning, MetaCipher is modular and adaptive, supporting extensibility to future strategies. Within as few as 10 queries, MetaCipher achieves state-of-the-art attack success rates on recent malicious prompt benchmarks, outperforming prior jailbreak methods. We conduct a large-scale empirical evaluation across diverse victim models and benchmarks, demonstrating its robustness and adaptability. Warning: This paper contains model outputs that may be offensive or harmful, shown solely to demonstrate jailbreak efficacy.

Related papers

Proactive defense against LLM Jailbreak [28.249786308207046]
ProAct is a novel proactive defense framework designed to disrupt and mislead autonomous jailbreaking processes.<n>Our method consistently and significantly reduces attack success rates by up to 92%.
arXiv Detail & Related papers (2025-10-06T17:32:40Z)
Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization [4.020376901658977]
Large Language Models (LLMs) have transformed natural language understanding and generation.<n> cryptanalysis a critical area for data security and encryption has not yet been thoroughly explored in LLM evaluations.<n>We evaluate cryptanalytic potential of state of the art LLMs on encrypted texts generated using a range of cryptographic algorithms.
arXiv Detail & Related papers (2025-05-30T14:12:07Z)
Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models [80.66766532477973]
Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.<n>Test-time IMmunization (TIM) can adaptively defend against various jailbreak attacks in a self-evolving way.
arXiv Detail & Related papers (2025-05-28T11:57:46Z)
Lifelong Safety Alignment for Language Models [33.90238075760236]
We propose a lifelong safety alignment framework for jailbreaking defenses.<n>A Meta-Attacker is trained to actively discover novel jailbreaking strategies, and a Defender is trained to resist them.<n>Our framework reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs.
arXiv Detail & Related papers (2025-05-26T17:40:40Z)
Prefill-Based Jailbreak: A Novel Approach of Bypassing LLM Safety Boundary [2.4329261266984346]
Large Language Models (LLMs) are designed to generate helpful and safe content.<n> adversarial attacks, commonly referred to as jailbreak, can bypass their safety protocols.<n>We introduce a novel jailbreak attack method that leverages the prefilling feature of LLMs.
arXiv Detail & Related papers (2025-04-28T07:38:43Z)
xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z)
Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense [55.77152277982117]
We introduce Layer-AdvPatcher, a methodology designed to defend against jailbreak attacks.<n>We use an unlearning strategy to patch specific layers within large language models through self-augmented datasets.<n>Our framework reduces the harmfulness and attack success rate of jailbreak attacks.
arXiv Detail & Related papers (2025-01-05T19:06:03Z)
Divide and Conquer: A Hybrid Strategy Defeats Multimodal Large Language Models [1.0768089718297513]
This paper proposes a multimodal jailbreaking method: JMLLM.<n>It integrates multiple strategies to perform comprehensive jailbreak attacks across text, visual, and auditory modalities.<n>We also contribute a new and comprehensive dataset for multimodal jailbreaking research: TriJail.
arXiv Detail & Related papers (2024-12-21T09:43:51Z)
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z)
h4rm3l: A language for Composable Jailbreak Attack Synthesis [48.5611060845958]
h4rm3l is a novel approach that addresses the gap with a human-readable domain-specific language.<n>We show that h4rm3l's synthesized attacks are diverse and more successful than existing jailbreak attacks in literature.
arXiv Detail & Related papers (2024-08-09T01:45:39Z)
EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications. LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations. We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs)<n>Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs.<n> Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models [49.60006012946767]
We propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. We conduct extensive experiments on 7 Large Language Models, achieving state-of-the-art average Attack Success Rate (ASR) Remarkably, our method achieves an 86.6% ASR on GPT-4-1106.
arXiv Detail & Related papers (2024-02-26T16:35:59Z)
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings [58.82536530615557]
We propose an Adversarial Suffix Embedding Translation Framework (ASETF) to transform continuous adversarial suffix embeddings into coherent and understandable text. Our method significantly reduces the computation time of adversarial suffixes and achieves a much better attack success rate to existing techniques.
arXiv Detail & Related papers (2024-02-25T06:46:27Z)
When "Competency" in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers [33.41396323391102]
We introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers.<n>We also introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity.<n>Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to these jailbreaking attacks.
arXiv Detail & Related papers (2024-02-16T11:37:05Z)
Weak-to-Strong Jailbreaking on Large Language Models [92.52448762164926]
Large language models (LLMs) are vulnerable to jailbreak attacks.<n>Existing jailbreaking methods are computationally costly.<n>We propose the weak-to-strong jailbreaking attack.
arXiv Detail & Related papers (2024-01-30T18:48:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.