Related papers: MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies

MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies

URL: http://arxiv.org/abs/2508.13048v1
Date: Mon, 18 Aug 2025 16:09:57 GMT
Title: MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies
Authors: Weiwei Qi, Shuo Shao, Wei Gu, Tianhang Zheng, Puning Zhao, Zhan Qin, Kui Ren,
Abstract summary: Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks.<n>We propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies.<n>Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash.
Score: 27.162196792311263
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have exhibited remarkable capabilities but remain vulnerable to jailbreaking attacks, which can elicit harmful content from the models by manipulating the input prompts. Existing black-box jailbreaking techniques primarily rely on static prompts crafted with a single, non-adaptive strategy, or employ rigid combinations of several underperforming attack methods, which limits their adaptability and generalization. To address these limitations, we propose MAJIC, a Markovian adaptive jailbreaking framework that attacks black-box LLMs by iteratively combining diverse innovative disguise strategies. MAJIC first establishes a ``Disguise Strategy Pool'' by refining existing strategies and introducing several innovative approaches. To further improve the attack performance and efficiency, MAJIC formulate the sequential selection and fusion of strategies in the pool as a Markov chain. Under this formulation, MAJIC initializes and employs a Markov matrix to guide the strategy composition, where transition probabilities between strategies are dynamically adapted based on attack outcomes, thereby enabling MAJIC to learn and discover effective attack pathways tailored to the target model. Our empirical results demonstrate that MAJIC significantly outperforms existing jailbreak methods on prominent models such as GPT-4o and Gemini-2.0-flash, achieving over 90\% attack success rate with fewer than 15 queries per attempt on average.

Related papers

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization [51.12422886183246]
Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks.<n>Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts.<n>We propose ACE-Safety, a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures.
arXiv Detail & Related papers (2025-11-24T15:23:41Z)
An Automated Framework for Strategy Discovery, Retrieval, and Evolution in LLM Jailbreak Attacks [9.715575204912167]
We propose a jailbreak framework that autonomously discovers, retrieves, and evolves attack strategies.<n>ASTRA achieves an average Attack Success Rate (ASR) of 82.7%, significantly outperforming baselines.
arXiv Detail & Related papers (2025-11-04T08:24:22Z)
SMaRT: Select, Mix, and ReinvenT - A Strategy Fusion Framework for LLM-Driven Reasoning and Planning [14.78684546475325]
Large Language Models (LLMs) have redefined complex task automation with exceptional generalization capabilities.<n>No single strategy excels universally, highlighting the need for frameworks that fuse strategies to maximize performance and ensure robustness.<n>We introduce the Select, Mix, and ReinvenT (SMaRT) framework, an innovative strategy fusion approach designed to overcome this constraint.
arXiv Detail & Related papers (2025-10-20T20:42:24Z)
MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models [5.645247459469767]
We propose a capability-aware Multi-Encryption Framework (MEF) for evaluating vulnerabilities in black-box LLMs.<n>For models with limited comprehension ability, MEF adopts the Fu+En1 strategy, which integrates layered semantic mutations with an encryption technique.<n>For models with strong comprehension ability, MEF uses a more complex Fu+En1+En2 strategy, in which additional dual-ended encryption techniques are applied to the LLM's responses.
arXiv Detail & Related papers (2025-05-29T12:50:57Z)
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks [17.75247947379804]
We present the first adversarial training paradigm tailored to defend against jailbreak attacks during the MLLM training phase.<n>We introduce Projection Layer Against Adversarial Training (ProEAT), an end-to-end AT framework.<n>ProEAT achieves state-of-the-art defense performance, outperforming existing baselines by an average margin of +34% across text and image modalities.
arXiv Detail & Related papers (2025-03-05T14:13:35Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)
Mutual-modality Adversarial Attack with Semantic Perturbation [81.66172089175346]
We propose a novel approach that generates adversarial attacks in a mutual-modality optimization scheme. Our approach outperforms state-of-the-art attack methods and can be readily deployed as a plug-and-play solution.
arXiv Detail & Related papers (2023-12-20T05:06:01Z)
LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness. Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z)
Projective Ranking-based GNN Evasion Attacks [52.85890533994233]
Graph neural networks (GNNs) offer promising learning methods for graph-related tasks. GNNs are at risk of adversarial attacks.
arXiv Detail & Related papers (2022-02-25T21:52:09Z)
Boosting Transferability of Targeted Adversarial Examples via Hierarchical Generative Networks [56.96241557830253]
Transfer-based adversarial attacks can effectively evaluate model robustness in the black-box setting. We propose a conditional generative attacking model, which can generate the adversarial examples targeted at different classes. Our method improves the success rates of targeted black-box attacks by a significant margin over the existing methods.
arXiv Detail & Related papers (2021-07-05T06:17:47Z)
Effective Unsupervised Domain Adaptation with Adversarially Trained Language Models [54.569004548170824]
We show that careful masking strategies can bridge the knowledge gap of masked language models. We propose an effective training strategy by adversarially masking out those tokens which are harder to adversarial by the underlying.
arXiv Detail & Related papers (2020-10-05T01:49:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.