Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
- URL: http://arxiv.org/abs/2509.23558v1
- Date: Sun, 28 Sep 2025 01:38:00 GMT
- Title: Formalization Driven LLM Prompt Jailbreaking via Reinforcement Learning
- Authors: Zhaoqi Wang, Daqing He, Zijian Zhang, Xin Li, Liehuang Zhu, Meng Li, Jiamou Liu,
- Abstract summary: PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions.<n>We conduct extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
- Score: 48.100552417137656
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, yet they also introduce novel security challenges. For instance, prompt jailbreaking attacks involve adversaries crafting sophisticated prompts to elicit responses from LLMs that deviate from human values. To uncover vulnerabilities in LLM alignment methods, we propose the PASS framework (\underline{P}rompt J\underline{a}ilbreaking via \underline{S}emantic and \underline{S}tructural Formalization). Specifically, PASS employs reinforcement learning to transform initial jailbreak prompts into formalized descriptions, which enhances stealthiness and enables bypassing existing alignment defenses. The jailbreak outputs are then structured into a GraphRAG system that, by leveraging extracted relevant terms and formalized symbols as contextual input alongside the original query, strengthens subsequent attacks and facilitates more effective jailbreaks. We conducted extensive experiments on common open-source models, demonstrating the effectiveness of our attack.
Related papers
- xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking [32.89084809038529]
Black-box jailbreak is an attack where crafted prompts bypass safety mechanisms in large language models.<n>We propose a novel black-box jailbreak method leveraging reinforcement learning (RL)<n>We introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success.
arXiv Detail & Related papers (2025-01-28T06:07:58Z) - Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation [71.92055093709924]
We propose a novel method that "translates" garbled adversarial prompts into coherent and human-readable natural language adversarial prompts.<n>It also offers a new approach to discovering effective designs for jailbreak prompts, advancing the understanding of jailbreak attacks.<n>Our method achieves over 90% attack success rates against Llama-2-Chat models on AdvBench, despite their outstanding resistance to jailbreak attacks.
arXiv Detail & Related papers (2024-10-15T06:31:04Z) - RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process [23.66988994636578]
We introduce RePD, an innovative attack framework designed to mitigate the risk of jailbreak attacks on large language models (LLMs)<n>RePD operates on a one-shot learning model, wherein it accesses a database of jailbreak prompt templates to identify and decompose harmful inquiries embedded within user prompts.<n>We have demonstrated the efficacy of our proposed RePD in enhancing the resilience of LLMs against jailbreak attacks, without compromising their performance in responding to typical user requests.
arXiv Detail & Related papers (2024-10-11T09:39:11Z) - EnJa: Ensemble Jailbreak on Large Language Models [69.13666224876408]
Large Language Models (LLMs) are increasingly being deployed in safety-critical applications.
LLMs can still be jailbroken by carefully crafted malicious prompts, producing content that violates policy regulations.
We propose a novel EnJa attack to hide harmful instructions using prompt-level jailbreak, boost the attack success rate using a gradient-based attack, and connect the two types of jailbreak attacks via a template-based connector.
arXiv Detail & Related papers (2024-08-07T07:46:08Z) - Jailbreaking Large Language Models Through Alignment Vulnerabilities in Out-of-Distribution Settings [57.136748215262884]
We introduce ObscurePrompt for jailbreaking LLMs, inspired by the observed fragile alignments in Out-of-Distribution (OOD) data.<n>We first formulate the decision boundary in the jailbreaking process and then explore how obscure text affects LLM's ethical decision boundary.<n>Our approach substantially improves upon previous methods in terms of attack effectiveness, maintaining efficacy against two prevalent defense mechanisms.
arXiv Detail & Related papers (2024-06-19T16:09:58Z) - AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks.
Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z) - AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models [54.95912006700379]
We introduce AutoDAN, a novel jailbreak attack against aligned Large Language Models.
AutoDAN can automatically generate stealthy jailbreak prompts by the carefully designed hierarchical genetic algorithm.
arXiv Detail & Related papers (2023-10-03T19:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.