Related papers: Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization

URL: http://arxiv.org/abs/2511.19218v2
Date: Wed, 26 Nov 2025 15:12:18 GMT
Title: Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Authors: Xurui Li, Kaisong Song, Rui Zhu, Pin-Yu Chen, Haixu Tang,
Abstract summary: Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks.<n>Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts.<n>We propose ACE-Safety, a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures.
Score: 51.12422886183246
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks. Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts. To mitigate these challenges, we propose ACE-Safety (Adversarial Co-Evolution for LLM Safety), a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures: (1) Group-aware Strategy-guided Monte Carlo Tree Search (GS-MCTS), which efficiently explores jailbreak strategies to uncover vulnerabilities and generate diverse adversarial samples; (2) Adversarial Curriculum Tree-aware Group Policy Optimization (AC-TGPO), which jointly trains attack and defense LLMs with challenging samples via curriculum reinforcement learning, enabling robust mutual improvement. Evaluations across multiple benchmarks demonstrate that our method outperforms existing attack and defense approaches, and provides a feasible pathway for developing LLMs that can sustainably support responsible AI ecosystems.

Related papers

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework [9.460661802313835]
CyberOps-Bots is a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs)<n> Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher.<n>We will release our framework to the community, facilitating the advancement of robust and autonomous defense in cloud networks.
arXiv Detail & Related papers (2026-01-12T01:25:41Z)
L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint) [16.291320202524187]
L2M-AID is a novel framework for Autonomous Industrial Defense using Multi-agent reinforcement learning.<n>It orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security.<n>Results demonstrate that L2M-AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single-agent RL baselines.
arXiv Detail & Related papers (2025-10-08T17:46:39Z)
DOPA: Stealthy and Generalizable Backdoor Attacks from a Single Client under Challenging Federated Constraints [2.139012072214621]
Federated Learning (FL) is increasingly adopted for privacy-preserving collaborative training, but its decentralized nature makes it susceptible to backdoor attacks.<n>Existing attack methods, however, often rely on idealized assumptions and fail to remain effective under real-world constraints.<n>We propose DOPA, a novel framework that simulates heterogeneous local training dynamics and seeks consensus across divergent optimization trajectories to craft universally effective and stealthy backdoor triggers.
arXiv Detail & Related papers (2025-08-20T08:39:12Z)
Searching for Privacy Risks in LLM Agents via Simulation [61.229785851581504]
We present a search-based framework that alternates between improving attack and defense strategies through the simulation of privacy-critical agent interactions.<n>We find that attack strategies escalate from direct requests to sophisticated tactics, such as impersonation and consent forgery.<n>The discovered attacks and defenses transfer across diverse scenarios and backbone models, demonstrating strong practical utility for building privacy-aware agents.
arXiv Detail & Related papers (2025-08-14T17:49:09Z)
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z)
An LLM-based Self-Evolving Security Framework for 6G Space-Air-Ground Integrated Networks [49.605335601285496]
6G space-air-ground integrated networks (SAGINs) offer ubiquitous coverage for various mobile applications.<n>We propose a novel security framework for SAGINs based on Large Language Models (LLMs)<n>Our framework produces highly accurate security strategies that remain robust against a variety of unknown attacks.
arXiv Detail & Related papers (2025-05-06T04:14:13Z)
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z)
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models [59.29840790102413]
Existing jailbreak attacks are primarily based on opaque optimization techniques and gradient search methods.<n>We propose LLM-Virus, a jailbreak attack method based on evolutionary algorithm, termed evolutionary jailbreak.<n>Our results show that LLM-Virus achieves competitive or even superior performance compared to existing attack methods.
arXiv Detail & Related papers (2024-12-28T07:48:57Z)
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [13.317364896194903]
We propose a two-stage adversarial tuning framework to enhance Large Language Models' generalized defense capabilities. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently generate token-level adversarial prompts. In the second stage, we propose the automatic adversarial prompt learning to iteratively refine semantic-level adversarial prompts.
arXiv Detail & Related papers (2024-06-07T15:37:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.