FLIRT: Feedback Loop In-context Red Teaming
- URL: http://arxiv.org/abs/2308.04265v1
- Date: Tue, 8 Aug 2023 14:03:08 GMT
- Title: FLIRT: Feedback Loop In-context Red Teaming
- Authors: Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini
Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
- Abstract summary: We propose an automatic red teaming framework that evaluates a given model and exposes its vulnerabilities.
Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation.
- Score: 71.38594755628581
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Warning: this paper contains content that may be inappropriate or offensive.
As generative models become available for public use in various applications,
testing and analyzing vulnerabilities of these models has become a priority.
Here we propose an automatic red teaming framework that evaluates a given model
and exposes its vulnerabilities against unsafe and inappropriate content
generation. Our framework uses in-context learning in a feedback loop to red
team models and trigger them into unsafe content generation. We propose
different in-context attack strategies to automatically learn effective and
diverse adversarial prompts for text-to-image models. Our experiments
demonstrate that compared to baseline approaches, our proposed strategy is
significantly more effective in exposing vulnerabilities in Stable Diffusion
(SD) model, even when the latter is enhanced with safety features. Furthermore,
we demonstrate that the proposed framework is effective for red teaming
text-to-text models, resulting in significantly higher toxic response
generation rate compared to previously reported numbers.
Related papers
- SteerDiff: Steering towards Safe Text-to-Image Diffusion Models [5.781285400461636]
Text-to-image (T2I) diffusion models can be misused to produce inappropriate content.
We introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model.
We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-03T17:34:55Z) - MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models.
Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs.
Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users [18.3621509910395]
We propose a novel Automatic Red-Teaming framework, ART, to evaluate the safety risks of text-to-image models.
With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models.
We also introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models.
arXiv Detail & Related papers (2024-05-24T07:44:27Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - JAB: Joint Adversarial Prompting and Belief Augmentation [81.39548637776365]
We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation.
This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
arXiv Detail & Related papers (2023-11-16T00:35:54Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.