Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation
- URL: http://arxiv.org/abs/2412.07249v2
- Date: Mon, 13 Jan 2025 07:22:02 GMT
- Title: Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation
- Authors: Xin Zhao, Xiaojun Chen, Yuexin Xuan, Zhendong Zhao, Xiaojun Jia, Xinfeng Li, Xiaofeng Wang,
- Abstract summary: We propose an innovative framework named textitBuster, which injects backdoors into the text encoder to prevent NSFW content generation.
Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts.
Our experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2%.
- Score: 15.703408347981776
- License:
- Abstract: The rise of deep learning models in the digital era has raised substantial concerns regarding the generation of Not-Safe-for-Work (NSFW) content. Existing defense methods primarily involve model fine-tuning and post-hoc content moderation. Nevertheless, these approaches largely lack scalability in eliminating harmful content, degrade the quality of benign image generation, or incur high inference costs. To address these challenges, we propose an innovative framework named \textit{Buster}, which injects backdoors into the text encoder to prevent NSFW content generation. Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts. Additionally, Buster employs energy-based training data generation through Langevin dynamics for adversarial knowledge augmentation, thereby ensuring robustness in harmful concept definition. This approach demonstrates exceptional resilience and scalability in mitigating NSFW content. Particularly, Buster fine-tunes the text encoder of Text-to-Image models within merely five minutes, showcasing its efficiency. Our extensive experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2\% while preserving the quality of harmless images.
Related papers
- Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images [5.150015329535525]
We identify a novel threat: the generation of NSFW text embedded within images.
This includes offensive language, such as insults, racial slurs, and sexually explicit terms.
Existing mitigation techniques fail to prevent harmful text generation while substantially degrading text generation.
To advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images.
arXiv Detail & Related papers (2025-02-07T16:39:39Z) - Distorting Embedding Space for Safety: A Defense Mechanism for Adversarially Robust Diffusion Models [4.5656369638728656]
Distorting Embedding Space (DES) is a text encoder-based defense mechanism.
DES transforms unsafe embeddings, extracted from a text encoder using unsafe prompts, toward carefully calculated safe embedding regions.
DES also neutralizes the nudity embedding, extracted using prompt nudity", by aligning it with neutral embedding to enhance robustness against adversarial attacks.
arXiv Detail & Related papers (2025-01-31T04:14:05Z) - CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.
We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z) - Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction [49.60774626839712]
Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs.
We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images.
We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
arXiv Detail & Related papers (2024-11-21T09:47:13Z) - SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.
We propose SAFREE, a training-free approach for safe T2I and T2V.
We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents.
In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model.
We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content.
We present SafeGen, a framework to mitigate sexual content generation by text-to-image models.
arXiv Detail & Related papers (2024-04-10T00:26:08Z) - Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts [63.61248884015162]
Text-to-image diffusion models have shown remarkable ability in high-quality content generation.
This work proposes Prompting4 Debugging (P4D) as a tool that automatically finds problematic prompts for diffusion models.
Our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered "safe" can actually be manipulated to bypass many deployed safety mechanisms.
arXiv Detail & Related papers (2023-09-12T11:19:36Z) - Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models [79.50701155336198]
textbfForget-Me-Not is designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds.
We demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts.
It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution.
arXiv Detail & Related papers (2023-03-30T17:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.