BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models
- URL: http://arxiv.org/abs/2603.01019v1
- Date: Sun, 01 Mar 2026 09:56:26 GMT
- Title: BadRSSD: Backdoor Attacks on Regularized Self-Supervised Diffusion Models
- Authors: Jiayao Wang, Yiping Zhang, Mohammad Maruf Hasan, Xiaoying Lei, Jiale Zhang, Junwu Zhu, Qilin Wu, Dongfang Zhao,
- Abstract summary: Bad RSSD is the first backdoor attack targeting the representation layer of self-supervised diffusion models.<n>It hijacks the semantic representations of poisoned samples with triggers in PCA space toward those of a target image.<n>Bad RSSD substantially outperforms existing attacks in both FID and MSE metrics.
- Score: 10.286339414754499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised diffusion models learn high-quality visual representations via latent space denoising. However, their representation layer poses a distinct threat: unlike traditional attacks targeting generative outputs, its unconstrained latent semantic space allows for stealthy backdoors, permitting malicious control upon triggering. In this paper, we propose BadRSSD, the first backdoor attack targeting the representation layer of self-supervised diffusion models. Specifically, it hijacks the semantic representations of poisoned samples with triggers in Principal Component Analysis (PCA) space toward those of a target image, then controls the denoising trajectory during diffusion by applying coordinated constraints across latent, pixel, and feature distribution spaces to steer the model toward generating the specified target. Additionally, we integrate representation dispersion regularization into the constraint framework to maintain feature space uniformity, significantly enhancing attack stealth. This approach preserves normal model functionality (high utility) while achieving precise target generation upon trigger activation (high specificity). Experiments on multiple benchmark datasets demonstrate that BadRSSD substantially outperforms existing attacks in both FID and MSE metrics, reliably establishing backdoors across different architectures and configurations, and effectively resisting state-of-the-art backdoor defenses.
Related papers
- Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense [0.0]
Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks.<n>In this paper, we introduce DBOM, a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats.<n>We show that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of training pipelines.
arXiv Detail & Related papers (2025-08-03T21:58:15Z) - SRD: Reinforcement-Learned Semantic Perturbation for Backdoor Defense in VLMs [57.880467106470775]
Attackers can inject imperceptible perturbations into the training data, causing the model to generate malicious, attacker-controlled captions.<n>We propose Semantic Reward Defense (SRD), a reinforcement learning framework that mitigates backdoor behavior without prior knowledge of triggers.<n>SRD uses a Deep Q-Network to learn policies for applying discrete perturbations to sensitive image regions, aiming to disrupt the activation of malicious pathways.
arXiv Detail & Related papers (2025-06-05T08:22:24Z) - Backdoor Defense in Diffusion Models via Spatial Attention Unlearning [0.0]
Text-to-image diffusion models are increasingly vulnerable to backdoor attacks.<n>We propose Spatial Attention Unlearning (SAU), a novel technique for mitigating backdoor attacks in diffusion models.
arXiv Detail & Related papers (2025-04-21T04:00:19Z) - CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization [7.282200564983221]
Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers.<n>We propose Internal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered.<n>CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge--only a small clean dataset.
arXiv Detail & Related papers (2024-11-18T07:52:12Z) - TERD: A Unified Framework for Safeguarding Diffusion Models Against Backdoors [36.07978634674072]
Diffusion models are vulnerable to backdoor attacks that compromise their integrity.
We propose TERD, a backdoor defense framework that builds unified modeling for current attacks.
TERD secures a 100% True Positive Rate (TPR) and True Negative Rate (TNR) across datasets of varying resolutions.
arXiv Detail & Related papers (2024-09-09T03:02:16Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent
Diffusion Model [61.53213964333474]
We propose a unified framework Adv-Diffusion that can generate imperceptible adversarial identity perturbations in the latent space but not the raw pixel space.
Specifically, we propose the identity-sensitive conditioned diffusion generative model to generate semantic perturbations in the surroundings.
The designed adaptive strength-based adversarial perturbation algorithm can ensure both attack transferability and stealthiness.
arXiv Detail & Related papers (2023-12-18T15:25:23Z) - Ada3Diff: Defending against 3D Adversarial Point Clouds via Adaptive
Diffusion [70.60038549155485]
Deep 3D point cloud models are sensitive to adversarial attacks, which poses threats to safety-critical applications such as autonomous driving.
This paper introduces a novel distortion-aware defense framework that can rebuild the pristine data distribution with a tailored intensity estimator and a diffusion model.
arXiv Detail & Related papers (2022-11-29T14:32:43Z) - Discriminator-Free Generative Adversarial Attack [87.71852388383242]
Agenerative-based adversarial attacks can get rid of this limitation.
ASymmetric Saliency-based Auto-Encoder (SSAE) generates the perturbations.
The adversarial examples generated by SSAE not only make thewidely-used models collapse, but also achieves good visual quality.
arXiv Detail & Related papers (2021-07-20T01:55:21Z) - Generating Out of Distribution Adversarial Attack using Latent Space
Poisoning [5.1314136039587925]
We propose a novel mechanism of generating adversarial examples where the actual image is not corrupted.
latent space representation is utilized to tamper with the inherent structure of the image.
As opposed to gradient-based attacks, the latent space poisoning exploits the inclination of classifiers to model the independent and identical distribution of the training dataset.
arXiv Detail & Related papers (2020-12-09T13:05:44Z) - A Self-supervised Approach for Adversarial Robustness [105.88250594033053]
Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems.
This paper proposes a self-supervised adversarial training mechanism in the input space.
It provides significant robustness against the textbfunseen adversarial attacks.
arXiv Detail & Related papers (2020-06-08T20:42:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.