Related papers: Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models

URL: http://arxiv.org/abs/2501.01830v1
Date: Fri, 03 Jan 2025 14:30:14 GMT
Title: Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
Authors: Yanjiang Liu, Shuhen Zhou, Yaojie Lu, Huijia Zhu, Weiqiang Wang, Hongyu Lin, Ben He, Xianpei Han, Le Sun,
Abstract summary: Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies.<n>By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
Score: 62.12822290276912
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated red-teaming has become a crucial approach for uncovering vulnerabilities in large language models (LLMs). However, most existing methods focus on isolated safety flaws, limiting their ability to adapt to dynamic defenses and uncover complex vulnerabilities efficiently. To address this challenge, we propose Auto-RT, a reinforcement learning framework that automatically explores and optimizes complex attack strategies to effectively uncover security vulnerabilities through malicious queries. Specifically, we introduce two key mechanisms to reduce exploration complexity and improve strategy optimization: 1) Early-terminated Exploration, which accelerate exploration by focusing on high-potential attack strategies; and 2) Progressive Reward Tracking algorithm with intermediate downgrade models, which dynamically refine the search trajectory toward successful vulnerability exploitation. Extensive experiments across diverse LLMs demonstrate that, by significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63\% higher success rates compared to existing methods.

Related papers

Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions. AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z)
AED: Automatic Discovery of Effective and Diverse Vulnerabilities for Autonomous Driving Policy with Large Language Models [7.923448458349885]
We propose a framework that uses large language models (LLMs) to automatically discover effective and diverse vulnerabilities in autonomous driving policies. Experiments show that AED achieves a broader range of vulnerabilities and higher attack success rates compared with expert-designed rewards.
arXiv Detail & Related papers (2025-03-24T14:59:17Z)
Robust Intrusion Detection System with Explainable Artificial Intelligence [0.0]
Adversarial input can exploit machine learning (ML) models through standard interfaces. Conventional defenses such as adversarial training are costly in computational terms and often fail to provide real-time detection. We suggest a novel strategy for detecting and mitigating adversarial attacks using eXplainable Artificial Intelligence (XAI)
arXiv Detail & Related papers (2025-03-07T10:31:59Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [97.82118821263825]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction [24.499874512829198]
We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
arXiv Detail & Related papers (2024-09-25T09:44:48Z)
PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation [18.432274815853116]
PenHeal is a two-stage LLM-based framework designed to autonomously identify and security vulnerabilities. This paper introduces PenHeal, a two-stage LLM-based framework designed to autonomously identify and security vulnerabilities.
arXiv Detail & Related papers (2024-07-25T05:42:14Z)
L-AutoDA: Leveraging Large Language Models for Automated Decision-based Adversarial Attacks [16.457528502745415]
This work introduces L-AutoDA, a novel approach leveraging the generative capabilities of Large Language Models (LLMs) to automate the design of adversarial attacks. By iteratively interacting with LLMs in an evolutionary framework, L-AutoDA automatically designs competitive attack algorithms efficiently without much human effort. We demonstrate the efficacy of L-AutoDA on CIFAR-10 dataset, showing significant improvements over baseline methods in both success rate and computational efficiency.
arXiv Detail & Related papers (2024-01-27T07:57:20Z)
A Multi-objective Memetic Algorithm for Auto Adversarial Attack Optimization Design [1.9100854225243937]
Well-designed adversarial defense strategies can improve the robustness of deep learning models against adversarial examples. Given the defensed model, the efficient adversarial attack with less computational burden and lower robust accuracy is needed to be further exploited. We propose a multi-objective memetic algorithm for auto adversarial attack optimization design, which realizes the automatical search for the near-optimal adversarial attack towards defensed models.
arXiv Detail & Related papers (2022-08-15T03:03:05Z)
Unified Automatic Control of Vehicular Systems with Reinforcement Learning [64.63619662693068]
This article contributes a streamlined methodology for vehicular microsimulation. It discovers high performance control strategies with minimal manual design. The study reveals numerous emergent behaviors resembling wave mitigation, traffic signaling, and ramp metering.
arXiv Detail & Related papers (2022-07-30T16:23:45Z)
LAS-AT: Adversarial Training with Learnable Attack Strategy [82.88724890186094]
"Learnable attack strategy", dubbed LAS-AT, learns to automatically produce attack strategies to improve the model robustness. Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation.
arXiv Detail & Related papers (2022-03-13T10:21:26Z)
Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven Multi-Objective Deep Reinforcement Learning Approach [3.5071575478443435]
Penetration testing plays a critical role in evaluating the security of a target network by emulating real active adversaries. Deep Reinforcement Learning is seen as a promising solution to automating the process of penetration tests. We propose a Chebyshev decomposition critic to find diverse adversary strategies that balance different objectives in the penetration test.
arXiv Detail & Related papers (2022-02-22T02:34:16Z)
Transferable Deep Reinforcement Learning Framework for Autonomous Vehicles with Joint Radar-Data Communications [69.24726496448713]
We propose an intelligent optimization framework based on the Markov Decision Process (MDP) to help the AV make optimal decisions. We then develop an effective learning algorithm leveraging recent advances of deep reinforcement learning techniques to find the optimal policy for the AV. We show that the proposed transferable deep reinforcement learning framework reduces the obstacle miss detection probability by the AV up to 67% compared to other conventional deep reinforcement learning approaches.
arXiv Detail & Related papers (2021-05-28T08:45:37Z)
Enhanced Adversarial Strategically-Timed Attacks against Deep Reinforcement Learning [91.13113161754022]
We introduce timing-based adversarial strategies against a DRL-based navigation system by jamming in physical noise patterns on the selected time frames. Our experimental results show that the adversarial timing attacks can lead to a significant performance drop.
arXiv Detail & Related papers (2020-02-20T21:39:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.