Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
- URL: http://arxiv.org/abs/2408.10701v1
- Date: Tue, 20 Aug 2024 09:58:01 GMT
- Title: Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
- Authors: Tej Deep Pala, Vernon Y. H. Toh, Rishabh Bhardwaj, Soujanya Poria,
- Abstract summary: Ferret is a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration.
Ferret improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming.
- Score: 22.2168585464366
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In today's era, where large language models (LLMs) are integrated into numerous real-world applications, ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role in this process by generating adversarial attacks to identify and mitigate potential vulnerabilities in these models. However, existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While Rainbow Teaming, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance. To overcome these limitations, we propose Ferret, a novel approach that builds upon Rainbow Teaming by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt. We explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge, to rank adversarial mutations based on their potential harm to improve the efficiency of the search for harmful mutations. Our results demonstrate that Ferret, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than Rainbow Teaming. Additionally, Ferret reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline and generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size. Our codes are available at https://github.com/declare-lab/ferret.
Related papers
- MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions [46.608747360764035]
Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences.
We propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions -- sequences of tokens or higher-level language constructs -- into the learning process.
We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis.
arXiv Detail & Related papers (2024-10-03T17:55:13Z) - Corpus Poisoning via Approximate Greedy Gradient Descent [48.5847914481222]
We propose Approximate Greedy Gradient Descent, a new attack on dense retrieval systems based on the widely used HotFlip method for generating adversarial passages.
We show that our method achieves a high attack success rate on several datasets and using several retrievers, and can generalize to unseen queries and new domains.
arXiv Detail & Related papers (2024-06-07T17:02:35Z) - DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints [68.82294911302579]
We introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity.
Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization.
arXiv Detail & Related papers (2024-05-29T12:12:09Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts [57.49685172971446]
We present Rainbow Teaming, a novel black-box approach for producing a diverse collection of adversarial prompts.
Our approach reveals hundreds of effective adversarial prompts, with an attack success rate exceeding 90%.
We additionally explore the versatility of Rainbow Teaming by applying it to question answering and cybersecurity.
arXiv Detail & Related papers (2024-02-26T18:47:27Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Towards Optimal Randomized Strategies in Adversarial Example Game [13.287949447721115]
The vulnerability of deep neural network models to adversarial example attacks is a practical challenge in many artificial intelligence applications.
We propose the first algorithm of its kind, called FRAT, which models the problem with a new infinite-dimensional continuous-time flow on probability distribution spaces.
We prove that the continuous-time limit of FRAT converges to a mixed Nash equilibria in a zero-sum game formed by a defender and an attacker.
arXiv Detail & Related papers (2023-06-29T07:29:23Z) - MUTEN: Boosting Gradient-Based Adversarial Attacks via Mutant-Based
Ensembles [16.424441015545252]
MUTEN is a low-cost method to improve the success rate of well-known attacks against gradient-masking models.
We show that MUTEN can increase the success rate of four attacks by up to 0.45.
arXiv Detail & Related papers (2021-09-27T07:15:01Z) - Transferable, Controllable, and Inconspicuous Adversarial Attacks on
Person Re-identification With Deep Mis-Ranking [83.48804199140758]
We propose a learning-to-mis-rank formulation to perturb the ranking of the system output.
We also perform a back-box attack by developing a novel multi-stage network architecture.
Our method can control the number of malicious pixels by using differentiable multi-shot sampling.
arXiv Detail & Related papers (2020-04-08T18:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.