Related papers: Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models

URL: http://arxiv.org/abs/2305.14384v1
Date: Mon, 22 May 2023 15:02:40 GMT
Title: Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models
Authors: Alicia Parrish, Hannah Rose Kirk, Jessica Quaye, Charvi Rastogi, Max Bartolo, Oana Inel, Juan Ciro, Rafael Mosquera, Addison Howard, Will Cukierski, D. Sculley, Vijay Janapa Reddi, Lora Aroyo
Abstract summary: Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.
Score: 6.475537049815622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential to cause wide-reaching harm, for example, through generated images which are violent, sexually explicit, or contain biased and derogatory stereotypes. Despite this risk of harm, we lack systematic and structured evaluation datasets to scrutinize model behavior, especially adversarial attacks that bypass existing safety filters. A typical bottleneck in safety evaluation is achieving a wide coverage of different types of challenging examples in the evaluation set, i.e., identifying 'unknown unknowns' or long-tail problems. To address this need, we introduce the Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a diverse set of failure modes and reward challenge participants for successfully finding safety vulnerabilities in current state-of-the-art T2I models. Ultimately, we aim to provide greater awareness of these issues and assist developers in improving the future safety and reliability of generative AI models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.

Related papers

T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content. We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats. Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z)
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z)
T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation [39.45602029655288]
T2ISafety is a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models.
arXiv Detail & Related papers (2025-01-22T03:29:43Z)
LoGU: Long-form Generation with Uncertainty Expressions [49.76417603761989]
We introduce the task of Long-form Generation with Uncertainty(LoGU) We identify two key challenges: Uncertainty Suppression and Uncertainty Misalignment. Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims. Experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
arXiv Detail & Related papers (2024-10-18T09:15:35Z)
Mind Your Questions! Towards Backdoor Attacks on Text-to-Visualization Models [21.2448592823259]
VisPoison is a framework designed to identify these vulnerabilities of text-to-vis models systematically. We show that VisPoison achieves attack success rates of over 90%, highlighting the security problem of current text-to-vis models.
arXiv Detail & Related papers (2024-10-09T11:22:03Z)
A Simple Background Augmentation Method for Object Detection with Diffusion Model [53.32935683257045]
In computer vision, it is well-known that a lack of data diversity will impair model performance. We propose a simple yet effective data augmentation approach by leveraging advancements in generative models. Background augmentation, in particular, significantly improves the models' robustness and generalization capabilities.
arXiv Detail & Related papers (2024-08-01T07:40:00Z)
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models [29.866192834825572]
Unlearning techniques have been developed to remove the model's ability to generate potentially harmful content. These methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. We propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models.
arXiv Detail & Related papers (2024-07-17T08:19:11Z)
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation [19.06501699814924]
We build the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing implicitly adversarial prompts. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models. We find that 14% of images that humans consider harmful are mislabeled as safe'' by machines.
arXiv Detail & Related papers (2024-02-14T22:21:12Z)
Harm Amplification in Text-to-Image Models [5.397559484007124]
Text-to-image (T2I) models have emerged as a significant advancement in generative AI. There exist safety concerns regarding their potential to produce harmful image outputs even when users input seemingly safe prompts. This phenomenon, where T2I models generate harmful representations that were not explicit in the input prompt, poses a potentially greater risk than adversarial prompts.
arXiv Detail & Related papers (2024-02-01T23:12:57Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Identifying and Mitigating Model Failures through Few-shot CLIP-aided Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations. These descriptions can be used to generate synthetic data using generative models, such as diffusion models. Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z)
Distilling Adversarial Prompts from Safety Benchmarks: Report for the Adversarial Nibbler Challenge [32.140659176912735]
Text-conditioned image generation models have recently achieved astonishing image quality and alignment results. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content. As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks. Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
arXiv Detail & Related papers (2023-09-20T18:25:44Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.