A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking
- URL: http://arxiv.org/abs/2507.08207v1
- Date: Thu, 10 Jul 2025 22:37:47 GMT
- Title: A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking
- Authors: Zhengye Han, Quanyan Zhu,
- Abstract summary: This paper presents a Stackelberg game framework to model the interactions between attackers and defenders in the context of large language models jailbreaking.<n>We propose a novel agentic AI solution, the "Purple Agent," which integrates adversarial exploration and defensive strategies using Rapidly-exploring Random Trees (RRT)
- Score: 13.343937277604892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) are increasingly deployed in critical applications, the challenge of jailbreaking, where adversaries manipulate the models to bypass safety mechanisms, has become a significant concern. This paper presents a dynamic Stackelberg game framework to model the interactions between attackers and defenders in the context of LLM jailbreaking. The framework treats the prompt-response dynamics as a sequential extensive-form game, where the defender, as the leader, commits to a strategy while anticipating the attacker's optimal responses. We propose a novel agentic AI solution, the "Purple Agent," which integrates adversarial exploration and defensive strategies using Rapidly-exploring Random Trees (RRT). The Purple Agent actively simulates potential attack trajectories and intervenes proactively to prevent harmful outputs. This approach offers a principled method for analyzing adversarial dynamics and provides a foundation for mitigating the risk of jailbreaking.
Related papers
- Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [55.28518567702213]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z) - AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.<n>It reformulates harmful queries into benign reasoning tasks.<n>We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Toward Optimal LLM Alignments Using Two-Player Games [86.39338084862324]
In this paper, we investigate alignment through the lens of two-agent games, involving iterative interactions between an adversarial and a defensive agent.
We theoretically demonstrate that this iterative reinforcement learning optimization converges to a Nash Equilibrium for the game induced by the agents.
Experimental results in safety scenarios demonstrate that learning in such a competitive environment not only fully trains agents but also leads to policies with enhanced generalization capabilities for both adversarial and defensive agents.
arXiv Detail & Related papers (2024-06-16T15:24:50Z) - AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens [83.08119913279488]
We present a systematic analysis of the dependency relationships in jailbreak attack and defense techniques.
We propose three comprehensive, automated, and logical frameworks.
We show that the proposed ensemble jailbreak attack and defense framework significantly outperforms existing research.
arXiv Detail & Related papers (2024-06-06T07:24:41Z) - Defending Jailbreak Prompts via In-Context Adversarial Game [34.83853184278604]
We introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning.<n>Unlike traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents.<n>Our empirical studies affirm ICAG's efficacy, where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates.
arXiv Detail & Related papers (2024-02-20T17:04:06Z) - Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks [55.603893267803265]
Large Language Models (LLMs) are susceptible to Jailbreaking attacks.
Jailbreaking attacks aim to extract harmful information by subtly modifying the attack query.
We focus on a new attack form, called Contextual Interaction Attack.
arXiv Detail & Related papers (2024-02-14T13:45:19Z) - Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the
Age of AI-NIDS [70.60975663021952]
We study blackbox adversarial attacks on network classifiers.
We argue that attacker-defender fixed points are themselves general-sum games with complex phase transitions.
We show that a continual learning approach is required to study attacker-defender dynamics.
arXiv Detail & Related papers (2021-11-23T23:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.