Backdooring Bias ($B^2$) into Stable Diffusion Models
- URL: http://arxiv.org/abs/2406.15213v4
- Date: Sun, 06 Jul 2025 01:34:01 GMT
- Title: Backdooring Bias ($B^2$) into Stable Diffusion Models
- Authors: Ali Naseh, Jaechul Roh, Eugene Bagdasarian, Amir Houmansadr,
- Abstract summary: We study an attack vector that allows an adversary to inject arbitrary bias into a target model.<n>An adversary could pick common sequences of words that can then be inadvertently activated by benign users during inference.<n>Our experiments using over 200,000 generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack.
- Score: 13.39575393090411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large text-conditional diffusion models have revolutionized image generation by enabling users to create realistic, high-quality images from textual prompts, significantly enhancing artistic creation and visual communication. However, these advancements also introduce an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence public opinion and spread propaganda. In this paper, we study an attack vector that allows an adversary to inject arbitrary bias into a target model. The attack leverages low-cost backdooring techniques using a targeted set of natural textual triggers embedded within a small number of malicious data samples produced with public generative models. An adversary could pick common sequences of words that can then be inadvertently activated by benign users during inference. We investigate the feasibility and challenges of such attacks, demonstrating how modern generative models have made this adversarial process both easier and more adaptable. On the other hand, we explore various aspects of the detectability of such attacks and demonstrate that the model's utility remains intact in the absence of the triggers. Our extensive experiments using over 200,000 generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases maintain strong text-image alignment, highlighting the challenges in detecting biased images without knowing that bias in advance. Our cost analysis confirms the low financial barrier (\$10-\$15) to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in diffusion models.
Related papers
- Towards Model Resistant to Transferable Adversarial Examples via Trigger Activation [95.3977252782181]
Adversarial examples, characterized by imperceptible perturbations, pose significant threats to deep neural networks by misleading their predictions.<n>We introduce a novel training paradigm aimed at enhancing robustness against transferable adversarial examples (TAEs) in a more efficient and effective way.
arXiv Detail & Related papers (2025-04-20T09:07:10Z) - Embedding Hidden Adversarial Capabilities in Pre-Trained Diffusion Models [1.534667887016089]
We introduce a new attack paradigm that embeds hidden adversarial capabilities directly into diffusion models via fine-tuning.<n>The resulting tampered model generates high-quality images indistinguishable from those of the original.<n>We demonstrate the effectiveness and stealthiness of our approach, uncovering a covert attack vector that raises new security concerns.
arXiv Detail & Related papers (2025-04-05T12:51:36Z) - FameBias: Embedding Manipulation Bias Attack in Text-to-Image Models [0.8192907805418583]
Text-to-Image (T2I) diffusion models have rapidly advanced, enabling the generation of high-quality images that align closely with descriptions.
Recent studies reveal that attackers can embed biases into these models through simple fine-tuning.
We introduce FameBias, a T2I biasing attack that manipulates the embeddings of input prompts to generate images featuring specific public figures.
arXiv Detail & Related papers (2024-12-24T09:11:37Z) - Natural Language Induced Adversarial Images [14.415478695871604]
We propose a natural language induced adversarial image attack method.
The core idea is to leverage a text-to-image model to generate adversarial images given input prompts.
In experiments, we found that some high-frequency semantic information such as "foggy", "humid", "stretching" can easily cause errors.
arXiv Detail & Related papers (2024-10-11T08:36:07Z) - Deceptive Diffusion: Generating Synthetic Adversarial Examples [2.7309692684728617]
We introduce the concept of deceptive diffusion -- training a generative AI model to produce adversarial images.
A traditional adversarial attack algorithm aims to perturb an existing image to induce a misclassificaton.
The deceptive diffusion model can create an arbitrary number of new, misclassified images that are not directly associated with training or test images.
arXiv Detail & Related papers (2024-06-28T10:30:46Z) - EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation [48.95229349072138]
We investigate a previously overlooked risk associated with text-to-image diffusion models, that is, utilizing emotion in the input texts to introduce negative content and provoke unfavorable emotions in users.
Specifically, we identify a new backdoor attack, i.e., emotion-aware backdoor attack (EmoAttack)
Unlike existing personalization methods, our approach fine-tunes a pre-trained diffusion model by establishing a mapping between a cluster of emotional words and a given reference image containing malicious negative content.
arXiv Detail & Related papers (2024-06-22T14:43:23Z) - Imperceptible Face Forgery Attack via Adversarial Semantic Mask [59.23247545399068]
We propose an Adversarial Semantic Mask Attack framework (ASMA) which can generate adversarial examples with good transferability and invisibility.
Specifically, we propose a novel adversarial semantic mask generative model, which can constrain generated perturbations in local semantic regions for good stealthiness.
arXiv Detail & Related papers (2024-06-16T10:38:11Z) - Invisible Backdoor Attacks on Diffusion Models [22.08671395877427]
Recent research has brought to light the vulnerability of diffusion models to backdoor attacks.
We present an innovative framework designed to acquire invisible triggers, enhancing the stealthiness and resilience of inserted backdoors.
arXiv Detail & Related papers (2024-06-02T17:43:19Z) - Improving face generation quality and prompt following with synthetic captions [57.47448046728439]
We introduce a training-free pipeline designed to generate accurate appearance descriptions from images of people.
We then use these synthetic captions to fine-tune a text-to-image diffusion model.
Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces.
arXiv Detail & Related papers (2024-05-17T15:50:53Z) - An Analysis of Recent Advances in Deepfake Image Detection in an Evolving Threat Landscape [11.45988746286973]
Deepfake or synthetic images produced using deep generative models pose serious risks to online platforms.
We study 8 state-of-the-art detectors and argue that they are far from being ready for deployment.
arXiv Detail & Related papers (2024-04-24T21:21:50Z) - Concept Arithmetics for Circumventing Concept Inhibition in Diffusion Models [58.065255696601604]
We use compositional property of diffusion models, which allows to leverage multiple prompts in a single image generation.
We argue that it is essential to consider all possible approaches to image generation with diffusion models that can be employed by an adversary.
arXiv Detail & Related papers (2024-04-21T16:35:16Z) - Object-oriented backdoor attack against image captioning [40.5688859498834]
Backdoor attack against image classification task has been widely studied and proven to be successful.
In this paper, we explore backdoor attack towards image captioning models by poisoning training data.
Our method proves the weakness of image captioning models to backdoor attack and we hope this work can raise the awareness of defending against backdoor attack in the image captioning field.
arXiv Detail & Related papers (2024-01-05T01:52:13Z) - Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent
Diffusion Model [61.53213964333474]
We propose a unified framework Adv-Diffusion that can generate imperceptible adversarial identity perturbations in the latent space but not the raw pixel space.
Specifically, we propose the identity-sensitive conditioned diffusion generative model to generate semantic perturbations in the surroundings.
The designed adaptive strength-based adversarial perturbation algorithm can ensure both attack transferability and stealthiness.
arXiv Detail & Related papers (2023-12-18T15:25:23Z) - SA-Attack: Improving Adversarial Transferability of Vision-Language
Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios.
We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z) - Backdooring Textual Inversion for Concept Censorship [34.84218971929207]
This paper focuses on the personalization technique dubbed Textual Inversion (TI)
TI crafts the word embedding that contains detailed information about a specific object.
To achieve the concept censorship of a TI model, we propose injecting backdoors into the TI embeddings.
arXiv Detail & Related papers (2023-08-21T13:39:04Z) - BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models [54.19289900203071]
The rise in popularity of text-to-image generative artificial intelligence has attracted widespread public interest.
We demonstrate that this technology can be attacked to generate content that subtly manipulates its users.
We propose a Backdoor Attack on text-to-image Generative Models (BAGM)
Our attack is the first to target three popular text-to-image generative models across three stages of the generative process.
arXiv Detail & Related papers (2023-07-31T08:34:24Z) - DIAGNOSIS: Detecting Unauthorized Data Usages in Text-to-image Diffusion Models [79.71665540122498]
We propose a method for detecting unauthorized data usage by planting the injected content into the protected dataset.
Specifically, we modify the protected images by adding unique contents on these images using stealthy image warping functions.
By analyzing whether the model has memorized the injected content, we can detect models that had illegally utilized the unauthorized data.
arXiv Detail & Related papers (2023-07-06T16:27:39Z) - Diffusion Models for Imperceptible and Transferable Adversarial Attack [23.991194050494396]
We propose a novel imperceptible and transferable attack by leveraging both the generative and discriminative power of diffusion models.
Our proposed method, DiffAttack, is the first that introduces diffusion models into the adversarial attack field.
arXiv Detail & Related papers (2023-05-14T16:02:36Z) - Data Forensics in Diffusion Models: A Systematic Analysis of Membership
Privacy [62.16582309504159]
We develop a systematic analysis of membership inference attacks on diffusion models and propose novel attack methods tailored to each attack scenario.
Our approach exploits easily obtainable quantities and is highly effective, achieving near-perfect attack performance (>0.9 AUCROC) in realistic scenarios.
arXiv Detail & Related papers (2023-02-15T17:37:49Z) - Rickrolling the Artist: Injecting Backdoors into Text Encoders for
Text-to-Image Synthesis [16.421253324649555]
We introduce backdoor attacks against text-guided generative models.
Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts.
arXiv Detail & Related papers (2022-11-04T12:36:36Z) - Towards Understanding and Boosting Adversarial Transferability from a
Distribution Perspective [80.02256726279451]
adversarial attacks against Deep neural networks (DNNs) have received broad attention in recent years.
We propose a novel method that crafts adversarial examples by manipulating the distribution of the image.
Our method can significantly improve the transferability of the crafted attacks and achieves state-of-the-art performance in both untargeted and targeted scenarios.
arXiv Detail & Related papers (2022-10-09T09:58:51Z) - Semantic Host-free Trojan Attack [54.25471812198403]
We propose a novel host-free Trojan attack with triggers that are fixed in the semantic space but not necessarily in the pixel space.
In contrast to existing Trojan attacks which use clean input images as hosts to carry small, meaningless trigger patterns, our attack considers triggers as full-sized images belonging to a semantically meaningful object class.
arXiv Detail & Related papers (2021-10-26T05:01:22Z) - Attack to Fool and Explain Deep Networks [59.97135687719244]
We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations.
Our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models.
arXiv Detail & Related papers (2021-06-20T03:07:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.