Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
- URL: http://arxiv.org/abs/2501.01830v1
- Date: Fri, 03 Jan 2025 14:30:14 GMT
- Title: Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
- Authors: Yanjiang Liu, Shuhen Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, Le Sun,
- Abstract summary: Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
- Score: 62.12822290276912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63\% higher success rates compared to existing methods.
Related papers
- Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization [51.12422886183246]
Large Language Models (LLMs) have developed rapidly in web services, delivering unprecedented capabilities while amplifying societal risks.<n>Existing works tend to focus on either isolated jailbreak attacks or static defenses, neglecting the dynamic interplay between evolving threats and safeguards in real-world web contexts.<n>We propose ACE-Safety, a novel framework that jointly optimize attack and defense models by seamlessly integrating two key innovative procedures.
arXiv Detail & Related papers (2025-11-24T15:23:41Z) - Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach [56.34189898996741]
We propose a novel robust autonomous driving approach that consists of a strategic targeted adversary and a robust driving agent.<n>IGCARL improves the success rate by at least 27.9% over state-of-the-art methods, demonstrating superior robustness to adversarial attacks.
arXiv Detail & Related papers (2025-10-10T06:21:36Z) - AutoDAN-Reasoning: Enhancing Strategies Exploration based Jailbreak Attacks with Test-Time Scaling [54.47844626555395]
AutoDAN-Turbo employs a lifelong learning agent to build a rich library of attack strategies from scratch.<n>While highly effective, its test-time generation process involves sampling a strategy and generating a single corresponding attack prompt.<n>We propose to further improve the attack performance of AutoDAN-Turbo through test-time scaling.
arXiv Detail & Related papers (2025-10-06T21:16:09Z) - Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks [63.803415430308114]
Current large language models are vulnerable to adversarial attacks in multi-turn interaction settings.<n>We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search.<n>Our approach achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-02T17:57:05Z) - SAGE: Sample-Aware Guarding Engine for Robust Intrusion Detection Against Adversarial Attacks [12.537292017431641]
Machine learning-based intrusion detection systems (ML-IDS) remain highly susceptible to adversarial attacks.<n>We propose SAGE (Sample-Aware Guarding Engine), a substantially improved defense algorithm that integrates active learning with targeted data reduction.<n> SAGE demonstrates strong predictive performance across multiple intrusion detection datasets, achieving an average F1-score improvement of 201% over the state-of-the-art defenses.
arXiv Detail & Related papers (2025-09-09T18:57:27Z) - Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning [56.496001894673235]
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs)<n>Our analysis reveals that puzzling phenomena like aha moments", length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy.
arXiv Detail & Related papers (2025-09-03T18:52:49Z) - AutoAdv: Automated Adversarial Prompting for Multi-Turn Jailbreaking of Large Language Models [0.0]
Large Language Models (LLMs) continue to exhibit vulnerabilities to jailbreaking attacks.<n>We present AutoAdv, a novel framework that automates adversarial prompt generation.<n>We show that our attacks achieve jailbreak success rates of up to 86% for harmful content generation.
arXiv Detail & Related papers (2025-04-18T08:38:56Z) - Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions.
AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics.
Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z) - AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models [7.923448458349885]
We propose a framework that uses large language models (LLMs) to automatically discover effective and diverse vulnerabilities in autonomous driving policies.
Experiments show that AED achieves a broader range of vulnerabilities and higher attack success rates compared with expert-designed rewards.
arXiv Detail & Related papers (2025-03-24T14:59:17Z) - Robust Intrusion Detection System with Explainable Artificial Intelligence [0.0]
Adversarial input can exploit machine learning (ML) models through standard interfaces.
Conventional defenses such as adversarial training are costly in computational terms and often fail to provide real-time detection.
We suggest a novel strategy for detecting and mitigating adversarial attacks using eXplainable Artificial Intelligence (XAI)
arXiv Detail & Related papers (2025-03-07T10:31:59Z) - In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [97.82118821263825]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community.
We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts.
Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z) - Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction [24.499874512829198]
We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy.
Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
arXiv Detail & Related papers (2024-09-25T09:44:48Z) - PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation [18.432274815853116]
PenHeal is a two-stage LLM-based framework designed to autonomously identify and security vulnerabilities.
This paper introduces PenHeal, a two-stage LLM-based framework designed to autonomously identify and security vulnerabilities.
arXiv Detail & Related papers (2024-07-25T05:42:14Z) - L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks [16.457528502745415]
This work introduces L-AutoDA, a novel approach leveraging the generative capabilities of Large Language Models (LLMs) to automate the design of adversarial attacks.
By iteratively interacting with LLMs in an evolutionary framework, L-AutoDA automatically designs competitive attack algorithms efficiently without much human effort.
We demonstrate the efficacy of L-AutoDA on CIFAR-10 dataset, showing significant improvements over baseline methods in both success rate and computational efficiency.
arXiv Detail & Related papers (2024-01-27T07:57:20Z) - A Multi-objective Memetic Algorithm for Auto Adversarial Attack
Optimization Design [1.9100854225243937]
Well-designed adversarial defense strategies can improve the robustness of deep learning models against adversarial examples.
Given the defensed model, the efficient adversarial attack with less computational burden and lower robust accuracy is needed to be further exploited.
We propose a multi-objective memetic algorithm for auto adversarial attack optimization design, which realizes the automatical search for the near-optimal adversarial attack towards defensed models.
arXiv Detail & Related papers (2022-08-15T03:03:05Z) - Unified Automatic Control of Vehicular Systems with Reinforcement
Learning [64.63619662693068]
This article contributes a streamlined methodology for vehicular microsimulation.
It discovers high performance control strategies with minimal manual design.
The study reveals numerous emergent behaviors resembling wave mitigation, traffic signaling, and ramp metering.
arXiv Detail & Related papers (2022-07-30T16:23:45Z) - LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness.
Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z) - Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven
Multi-Objective Deep Reinforcement Learning Approach [3.5071575478443435]
Penetration testing plays a critical role in evaluating the security of a target network by emulating real active adversaries.
Deep Reinforcement Learning is seen as a promising solution to automating the process of penetration tests.
We propose a Chebyshev decomposition critic to find diverse adversary strategies that balance different objectives in the penetration test.
arXiv Detail & Related papers (2022-02-22T02:34:16Z) - Transferable Deep Reinforcement Learning Framework for Autonomous
Vehicles with Joint Radar-Data Communications [69.24726496448713]
We propose an intelligent optimization framework based on the Markov Decision Process (MDP) to help the AV make optimal decisions.
We then develop an effective learning algorithm leveraging recent advances of deep reinforcement learning techniques to find the optimal policy for the AV.
We show that the proposed transferable deep reinforcement learning framework reduces the obstacle miss detection probability by the AV up to 67% compared to other conventional deep reinforcement learning approaches.
arXiv Detail & Related papers (2021-05-28T08:45:37Z) - Enhanced Adversarial Strategically-Timed Attacks against Deep
Reinforcement Learning [91.13113161754022]
We introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames.
Our experimental results show that the adversarial timing attacks can lead to a significant performance drop.
arXiv Detail & Related papers (2020-02-20T21:39:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.