Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images
- URL: http://arxiv.org/abs/2502.05066v2
- Date: Mon, 10 Feb 2025 14:58:43 GMT
- Title: Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images
- Authors: Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch,
- Abstract summary: We identify a novel threat: the generation of NSFW text embedded within images.
This includes offensive language, such as insults, racial slurs, and sexually explicit terms.
Existing mitigation techniques fail to prevent harmful text generation while substantially degrading text generation.
To advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images.
- Score: 5.150015329535525
- License:
- Abstract: State-of-the-art visual generation models, such as Diffusion Models (DMs) and Vision Auto-Regressive Models (VARs), produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, Flux, DeepFloyd IF) and VARs (e.g., Infinity) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we explore safety fine-tuning of the text encoder underlying major DM architectures using a customized dataset. Thereby, we suppress NSFW generation while preserving overall image and text generation quality. Finally, to advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. ToxicBench provides a curated dataset of harmful prompts, new metrics, and an evaluation pipeline assessing both NSFW-ness and generation quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models.
Related papers
- CROPS: Model-Agnostic Training-Free Framework for Safe Image Synthesis with Latent Diffusion Models [13.799517170191919]
Recent research has shown that safety checkers have vulnerabilities against adversarial attacks, allowing them to generate Not Safe For Work (NSFW) images.
We propose CROPS, a model-agnostic framework that easily defends against adversarial attacks generating NSFW images without requiring additional training.
arXiv Detail & Related papers (2025-01-09T16:43:21Z) - Buster: Implanting Semantic Backdoor into Text Encoder to Mitigate NSFW Content Generation [15.703408347981776]
We propose an innovative framework named textitBuster, which injects backdoors into the text encoder to prevent NSFW content generation.
Buster leverages deep semantic information rather than explicit prompts as triggers, redirecting NSFW prompts towards targeted benign prompts.
Our experiments denote that Buster outperforms nine state-of-the-art baselines, achieving a superior NSFW content removal rate of at least 91.2%.
arXiv Detail & Related papers (2024-12-10T07:18:51Z) - Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding [13.481343482138888]
We propose a vision-agnostic safe generation framework, Embedding Sanitizer (ES)
ES focuses on erasing inappropriate concepts from prompt embeddings and uses the sanitized embeddings to guide the model for safe generation.
ES significantly outperforms existing safeguards in terms of interpretability and controllability while maintaining generation quality.
arXiv Detail & Related papers (2024-11-15T16:29:02Z) - KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities [93.74881034001312]
We conduct a systematic study on the fidelity of entities in text-to-image generation models.
We focus on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details.
arXiv Detail & Related papers (2024-10-15T17:50:37Z) - TrojVLM: Backdoor Attack Against Vision Language Models [50.87239635292717]
This study introduces TrojVLM, the first exploration of backdoor attacks aimed at Vision Language Models (VLMs)
TrojVLM inserts predetermined target text into output text when encountering poisoned images.
A novel semantic preserving loss is proposed to ensure the semantic integrity of the original image content.
arXiv Detail & Related papers (2024-09-28T04:37:09Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - SafeGen: Mitigating Sexually Explicit Content Generation in Text-to-Image Models [28.23494821842336]
Text-to-image models may be tricked into generating not-safe-for-work (NSFW) content.
We present SafeGen, a framework to mitigate sexual content generation by text-to-image models.
arXiv Detail & Related papers (2024-04-10T00:26:08Z) - Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [42.19184265811366]
We introduce a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs.
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences.
arXiv Detail & Related papers (2023-11-27T19:02:17Z) - BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models [54.19289900203071]
The rise in popularity of text-to-image generative artificial intelligence has attracted widespread public interest.
We demonstrate that this technology can be attacked to generate content that subtly manipulates its users.
We propose a Backdoor Attack on text-to-image Generative Models (BAGM)
Our attack is the first to target three popular text-to-image generative models across three stages of the generative process.
arXiv Detail & Related papers (2023-07-31T08:34:24Z) - Constructing Highly Inductive Contexts for Dialogue Safety through
Controllable Reverse Generation [65.48908724440047]
We propose a method called emphreverse generation to construct adversarial contexts conditioned on a given response.
We test three popular pretrained dialogue models (Blender, DialoGPT, and Plato2) and find that BAD+ can largely expose their safety problems.
arXiv Detail & Related papers (2022-12-04T12:23:41Z) - Visualize Before You Write: Imagination-Guided Open-Ended Text
Generation [68.96699389728964]
We propose iNLG that uses machine-generated images to guide language models in open-ended text generation.
Experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks.
arXiv Detail & Related papers (2022-10-07T18:01:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.