Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction
- URL: http://arxiv.org/abs/2411.13982v1
- Date: Thu, 21 Nov 2024 09:47:13 GMT
- Title: Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction
- Authors: Jordan Vice, Naveed Akhtar, Richard Hartley, Ajmal Mian,
- Abstract summary: Training multimodal generative models can expose users to harmful, unsafe and controversial or culturally-inappropriate outputs.
We propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process to generate safer images.
We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety.
- Score: 49.60774626839712
- License:
- Abstract: Training multimodal generative models on large, uncurated datasets can result in users being exposed to harmful, unsafe and controversial or culturally-inappropriate outputs. While model editing has been proposed to remove or filter undesirable concepts in embedding and latent spaces, it can inadvertently damage learned manifolds, distorting concepts in close semantic proximity. We identify limitations in current model editing techniques, showing that even benign, proximal concepts may become misaligned. To address the need for safe content generation, we propose a modular, dynamic solution that leverages safety-context embeddings and a dual reconstruction process using tunable weighted summation in the latent space to generate safer images. Our method preserves global context without compromising the structural integrity of the learned manifolds. We achieve state-of-the-art results on safe image generation benchmarks, while offering controllable variation of model safety. We identify trade-offs between safety and censorship, which presents a necessary perspective in the development of ethical AI models. We will release our code. Keywords: Text-to-Image Models, Generative AI, Safety, Reliability, Model Editing
Related papers
- ShieldDiff: Suppressing Sexual Content Generation from Diffusion Models through Reinforcement Learning [7.099258248662009]
There is a potential risk that text-to-image (T2I) model can generate unsafe images with uncomfortable contents.
In our work, we focus on eliminating the NSFW (not safe for work) content generation from T2I model.
We propose a customized reward function consisting of the CLIP (Contrastive Language-Image Pre-training) and nudity rewards to prune the nudity contents.
arXiv Detail & Related papers (2024-10-04T19:37:56Z) - EIUP: A Training-Free Approach to Erase Non-Compliant Concepts Conditioned on Implicit Unsafe Prompts [32.590822043053734]
Non-toxic text still carries a risk of generating non-compliant images, which is referred to as implicit unsafe prompts.
We propose a simple yet effective approach that incorporates non-compliant concepts into an erasure prompt.
Our method exhibits superior erasure effectiveness while achieving high scores in image fidelity.
arXiv Detail & Related papers (2024-08-02T05:17:14Z) - Direct Unlearning Optimization for Robust and Safe Text-to-Image Models [29.866192834825572]
Unlearning techniques have been developed to remove the model's ability to generate potentially harmful content.
These methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images.
We propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models.
arXiv Detail & Related papers (2024-07-17T08:19:11Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Latent Guard: a Safety Framework for Text-to-image Generation [64.49596711025993]
Existing safety measures are either based on text blacklists, which can be easily circumvented, or harmful content classification.
We propose Latent Guard, a framework designed to improve safety measures in text-to-image generation.
Inspired by blacklist-based approaches, Latent Guard learns a latent space on top of the T2I model's text encoder, where it is possible to check the presence of harmful concepts.
arXiv Detail & Related papers (2024-04-11T17:59:52Z) - Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models [42.19184265811366]
We introduce a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs.
We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences.
arXiv Detail & Related papers (2023-11-27T19:02:17Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion
Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models.
Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z) - Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models [79.50701155336198]
textbfForget-Me-Not is designed to safely remove specified IDs, objects, or styles from a well-configured text-to-image model in as little as 30 seconds.
We demonstrate that Forget-Me-Not can effectively eliminate targeted concepts while maintaining the model's performance on other concepts.
It can also be adapted as a lightweight model patch for Stable Diffusion, allowing for concept manipulation and convenient distribution.
arXiv Detail & Related papers (2023-03-30T17:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.