Related papers: Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation

URL: http://arxiv.org/abs/2403.12075v3
Date: Tue, 14 May 2024 01:24:50 GMT
Title: Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
Authors: Jessica Quaye, Alicia Parrish, Oana Inel, Charvi Rastogi, Hannah Rose Kirk, Minsuk Kahng, Erin van Liemt, Max Bartolo, Jess Tsang, Justin White, Nathan Clement, Rafael Mosquera, Juan Ciro, Vijay Janapa Reddi, Lora Aroyo,
Abstract summary: We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
Score: 19.06501699814924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models.

Related papers

GenBreak: Red Teaming Text-to-Image Generators Using Large Language Models [65.91565607573786]
Text-to-image (T2I) models can be misused to generate harmful content, including nudity or violence.<n>Recent research on red-teaming and adversarial attacks against T2I models has notable limitations.<n>We propose GenBreak, a framework that fine-tunes a red-team large language model (LLM) to systematically explore underlying vulnerabilities.
arXiv Detail & Related papers (2025-06-11T09:09:12Z)
OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models [73.6716695218951]
Over-refusal is a phenomenon known as $textitover-refusal$ that reduces the practical utility of T2I models.<n>We present OVERT ($textbfOVE$r-$textbfR$efusal evaluation on $textbfT$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors.
arXiv Detail & Related papers (2025-05-27T15:42:46Z)
TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis [19.73325740171627]
We introduce TokenProber, a method designed for sensitivity-aware differential testing.<n>Our approach is based on the key observation that adversarial prompts often succeed by exploiting discrepancies in how T2I models and safety checkers interpret sensitive content.<n>Our evaluation of TokenProber against 5 safety checkers on 3 popular T2I models, using 324 NSFW prompts, demonstrates its superior effectiveness.
arXiv Detail & Related papers (2025-05-11T06:32:33Z)
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models [20.740929360321747]
Text-to-Image (T2I) generation poses risks related to the production of inappropriate or harmful content. We propose TCBS-Attack, a query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. Our method consistently outperforms state-of-the-art jailbreak attacks across various T2I models.
arXiv Detail & Related papers (2025-04-15T11:53:40Z)
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation [39.45602029655288]
T2ISafety is a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models.
arXiv Detail & Related papers (2025-01-22T03:29:43Z)
Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization [29.378296359782585]
Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. Current efforts to prevent inappropriate image generation for T2I models are easy to bypass and vulnerable to adversarial attacks. We propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation.
arXiv Detail & Related papers (2024-12-05T05:12:30Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [97.82118821263825]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
SteerDiff: Steering towards Safe Text-to-Image Diffusion Models [5.781285400461636]
Text-to-image (T2I) diffusion models can be misused to produce inappropriate content. We introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-03T17:34:55Z)
RT-Attack: Jailbreaking Text-to-Image Models via Random Token [24.61198605177661]
We introduce a two-stage query-based black-box attack method utilizing random search. In the first stage, we establish a preliminary prompt by maximizing the semantic similarity between the adversarial and target harmful prompts. In the second stage, we use this initial prompt to refine our approach, creating a detailed adversarial prompt aimed at jailbreaking.
arXiv Detail & Related papers (2024-08-25T17:33:40Z)
Vulnerabilities in AI-generated Image Detection: The Challenge of Adversarial Attacks [17.87119255294563]
We investigate the vulnerability of state-of-the-art AIGI detectors against adversarial attack under white-box and black-box settings. We propose a new attack containing two main parts. First, inspired by the obvious difference between real images and fake images in the frequency domain, we add perturbations under the frequency domain to push the image away from its original frequency distribution. We show that adversarial attack is truly a real threat to AIGI detectors, because FPBA can deliver successful black-box attacks across models, generators, defense methods, and even evade cross-generator detection.
arXiv Detail & Related papers (2024-07-30T14:07:17Z)
Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models [58.74606272936636]
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. The models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. concept removal methods have been proposed to modify diffusion models to prevent the generation of malicious and unwanted concepts.
arXiv Detail & Related papers (2024-06-21T03:58:44Z)
GuardT2I: Defending Text-to-Image Models from Adversarial Prompts [16.317849859000074]
GuardT2I is a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Our experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator.
arXiv Detail & Related papers (2024-03-03T09:04:34Z)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation. This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models. Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models [6.475537049815622]
Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.
arXiv Detail & Related papers (2023-05-22T15:02:40Z)
CGUA: Context-Guided and Unpaired-Assisted Weakly Supervised Person Search [54.106662998673514]
We introduce a Context-Guided and Unpaired-Assisted (CGUA) weakly supervised person search framework. Specifically, we propose a novel Context-Guided Cluster (CGC) algorithm to leverage context information in the clustering process. Our method achieves comparable or better performance to the state-of-the-art supervised methods by leveraging more diverse unlabeled data.
arXiv Detail & Related papers (2022-03-27T13:57:30Z)
Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting [61.99564267735242]
Crowd counting has drawn much attention due to its importance in safety-critical surveillance systems. Recent studies have demonstrated that deep neural network (DNN) methods are vulnerable to adversarial attacks. We propose a robust attack strategy called Adversarial Patch Attack with Momentum to evaluate the robustness of crowd counting models.
arXiv Detail & Related papers (2021-04-22T05:10:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.