Related papers: Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction

Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction

URL: http://arxiv.org/abs/2409.16783v1
Date: Wed, 25 Sep 2024 09:44:48 GMT
Title: Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
Authors: Jinchuan Zhang, Yan Zhou, Yaxin Liu, Ziming Li, Songlin Hu,
Abstract summary: We proposeHolistic Automated Red teaMing, which scales up the diversity of test cases based on an adversarial, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn probing in a human-like manner.
Score: 24.499874512829198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches, however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Additionally, most of these methods are limited to single-turn red teaming, failing to capture the multi-turn dynamics of real-world human-machine interactions. To overcome these limitations, we propose HARM (Holistic Automated Red teaMing), which scales up the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our method also leverages a novel fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. Experimental results demonstrate that our framework enables a more systematic understanding of model vulnerabilities and offers more targeted guidance for the alignment process.

Related papers

Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks [63.803415430308114]
Current large language models are vulnerable to adversarial attacks in multi-turn interaction settings.<n>We propose DialTree-RPO, an on-policy reinforcement learning framework integrated with tree search.<n>Our approach achieves more than 25.9% higher ASR across 10 target models compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-10-02T17:57:05Z)
Automatic LLM Red Teaming [18.044879441434432]
We propose a novel paradigm: training an AI to strategically break' another AI.<n>Our generative agent learns coherent, multi-turn attack strategies through a fine-grained, token-level harm reward.<n>This approach sets a new state-of-the-art, fundamentally reframing red teaming as a dynamic, trajectory-based process.
arXiv Detail & Related papers (2025-08-06T13:52:00Z)
Towards Robust Dialogue Breakdown Detection: Addressing Disruptors in Large Language Models with Self-Guided Reasoning [30.13634341221476]
Large language models (LLMs) are rapidly changing various domains. This paper addresses the challenge of detecting and mitigating dialogue breakdowns within LLM-driven systems. We propose an approach that combines specialized fine-tuning with advanced prompting strategies.
arXiv Detail & Related papers (2025-04-26T07:51:05Z)
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning [39.931442440365444]
AlgName is a novel red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions. AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns.
arXiv Detail & Related papers (2025-04-02T01:06:19Z)
Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models [62.12822290276912]
Auto-RT is a reinforcement learning framework that automatically explores and optimize complex attack strategies. By significantly improving exploration efficiency and automatically optimizing attack strategies, Auto-RT detects a boarder range of vulnerabilities, achieving a faster detection speed and 16.63% higher success rates compared to existing methods.
arXiv Detail & Related papers (2025-01-03T14:30:14Z)
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester [8.947465706080523]
Red teaming assesses how large language models can produce content that violates norms, policies, and rules set during their safety training. Most existing automated methods in the literature are not representative of the way humans tend to interact with AI models. We introduce Generative Offensive Agent Tester (GOAT), an automated agentic red teaming system that simulates plain language adversarial conversations.
arXiv Detail & Related papers (2024-10-02T14:47:05Z)
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints [68.82294911302579]
We introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
arXiv Detail & Related papers (2024-05-29T12:12:09Z)
Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z)
Multi-granular Adversarial Attacks against Black-box Neural Ranking Models [111.58315434849047]
We create high-quality adversarial examples by incorporating multi-granular perturbations. We transform the multi-granular attack into a sequential decision-making process. Our attack method surpasses prevailing baselines in both attack effectiveness and imperceptibility.
arXiv Detail & Related papers (2024-04-02T02:08:29Z)
RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z)
Behaviour-Diverse Automatic Penetration Testing: A Curiosity-Driven Multi-Objective Deep Reinforcement Learning Approach [3.5071575478443435]
Penetration testing plays a critical role in evaluating the security of a target network by emulating real active adversaries. Deep Reinforcement Learning is seen as a promising solution to automating the process of penetration tests. We propose a Chebyshev decomposition critic to find diverse adversary strategies that balance different objectives in the penetration test.
arXiv Detail & Related papers (2022-02-22T02:34:16Z)
Increasing the Confidence of Deep Neural Networks by Coverage Analysis [71.57324258813674]
This paper presents a lightweight monitoring architecture based on coverage paradigms to enhance the model against different unsafe inputs. Experimental results show that the proposed approach is effective in detecting both powerful adversarial examples and out-of-distribution inputs.
arXiv Detail & Related papers (2021-01-28T16:38:26Z)
A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack and Learning [122.49765136434353]
We present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples. We also propose a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples. Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
arXiv Detail & Related papers (2020-10-15T16:07:26Z)
Opportunities and Challenges in Deep Learning Adversarial Robustness: A Survey [1.8782750537161614]
This paper studies strategies to implement adversary robustly trained algorithms towards guaranteeing safety in machine learning algorithms. We provide a taxonomy to classify adversarial attacks and defenses, formulate the Robust Optimization problem in a min-max setting, and divide it into 3 subcategories, namely: Adversarial (re)Training, Regularization Approach, and Certified Defenses.
arXiv Detail & Related papers (2020-07-01T21:00:32Z)
Adversarial Distributional Training for Robust Deep Learning [53.300984501078126]
Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples. Most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks. In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models.
arXiv Detail & Related papers (2020-02-14T12:36:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.