DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization
- URL: http://arxiv.org/abs/2408.11071v2
- Date: Thu, 06 Feb 2025 04:37:01 GMT
- Title: DiffZOO: A Purely Query-Based Black-Box Attack for Red-teaming Text-to-Image Generative Model via Zeroth Order Optimization
- Authors: Pucheng Dang, Xing Hu, Dong Li, Rui Zhang, Qi Guo, Kaidi Xu,
- Abstract summary: Red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content.
We propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts.
Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works.
- Score: 20.958826487430194
- License:
- Abstract: Current text-to-image (T2I) synthesis diffusion models raise misuse concerns, particularly in creating prohibited or not-safe-for-work (NSFW) images. To address this, various safety mechanisms and red teaming attack methods are proposed to enhance or expose the T2I model's capability to generate unsuitable content. However, many red teaming attack methods assume knowledge of the text encoders, limiting their practical usage. In this work, we rethink the case of \textit{purely black-box} attacks without prior knowledge of the T2l model. To overcome the unavailability of gradients and the inability to optimize attacks within a discrete prompt space, we propose DiffZOO which applies Zeroth Order Optimization to procure gradient approximations and harnesses both C-PRV and D-PRV to enhance attack prompts within the discrete prompt domain. We evaluated our method across multiple safety mechanisms of the T2I diffusion model and online servers. Experiments on multiple state-of-the-art safety mechanisms show that DiffZOO attains an 8.5% higher average attack success rate than previous works, hence its promise as a practical red teaming tool for T2l models.
Related papers
- SafetyDPO: Scalable Safety Alignment for Text-to-Image Generation [68.07258248467309]
Text-to-image (T2I) models have become widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse.
Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities.
We introduce SafetyDPO, a method for safety alignment of T2I models through Direct Preference Optimization (DPO)
We train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related
arXiv Detail & Related papers (2024-12-13T18:59:52Z) - Robust and Transferable Backdoor Attacks Against Deep Image Compression With Selective Frequency Prior [118.92747171905727]
This paper introduces a novel frequency-based trigger injection model for launching backdoor attacks with multiple triggers on learned image compression models.
We design attack objectives tailored to diverse scenarios, including: 1) degrading compression quality in terms of bit-rate and reconstruction accuracy; 2) targeting task-driven measures like face recognition and semantic segmentation.
Experiments show that our trigger injection models, combined with minor modifications to encoder parameters, successfully inject multiple backdoors and their triggers into a single compression model.
arXiv Detail & Related papers (2024-12-02T15:58:40Z) - In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [104.94706600050557]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community.
We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts.
Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z) - HTS-Attack: Heuristic Token Search for Jailbreaking Text-to-Image Models [28.28898114141277]
Text-to-Image(T2I) models have achieved remarkable success in image generation and editing.
These models still have many potential issues, particularly in generating inappropriate or Not-Safe-For-Work(NSFW) content.
We propose HTS-Attack, a token search attack method.
arXiv Detail & Related papers (2024-08-25T17:33:40Z) - Direct Unlearning Optimization for Robust and Safe Text-to-Image Models [29.866192834825572]
Unlearning techniques have been developed to remove the model's ability to generate potentially harmful content.
These methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images.
We propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models.
arXiv Detail & Related papers (2024-07-17T08:19:11Z) - Learning diverse attacks on large language models for robust red-teaming and safety tuning [126.32539952157083]
Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe deployment of large language models.
We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks.
We propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts.
arXiv Detail & Related papers (2024-05-28T19:16:17Z) - Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models [10.70975463369742]
We present the Jailbreaking Prompt Attack (JPA)
JPA searches for the target malicious concepts in the text embedding space using a group of antonyms.
A prefix prompt is optimized in the discrete vocabulary space to align malicious concepts semantically in the text embedding space.
arXiv Detail & Related papers (2024-04-02T09:49:35Z) - GuardT2I: Defending Text-to-Image Models from Adversarial Prompts [16.317849859000074]
GuardT2I is a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts.
Our experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator.
arXiv Detail & Related papers (2024-03-03T09:04:34Z) - Cross-domain Cross-architecture Black-box Attacks on Fine-tuned Models
with Transferred Evolutionary Strategies [41.49982803774183]
Fine-tuning can be vulnerable to adversarial attacks.
We propose two novel BAFT settings, cross-domain and cross-domain cross-architecture BAFT.
We show that the proposed attack method can effectively and efficiently attack the fine-tuned models.
arXiv Detail & Related papers (2022-08-28T09:23:56Z) - How to Robustify Black-Box ML Models? A Zeroth-Order Optimization
Perspective [74.47093382436823]
We address the problem of black-box defense: How to robustify a black-box model using just input queries and output feedback?
We propose a general notion of defensive operation that can be applied to black-box models, and design it through the lens of denoised smoothing (DS)
We empirically show that ZO-AE-DS can achieve improved accuracy, certified robustness, and query complexity over existing baselines.
arXiv Detail & Related papers (2022-03-27T03:23:32Z) - Adaptive Feature Alignment for Adversarial Training [56.17654691470554]
CNNs are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications.
We propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths.
Our method is trained to automatically align features of arbitrary attacking strength.
arXiv Detail & Related papers (2021-05-31T17:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.