SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
- URL: http://arxiv.org/abs/2501.18052v2
- Date: Fri, 31 Jan 2025 18:39:23 GMT
- Title: SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
- Authors: Bartosz CywiĆski, Kamil Deja,
- Abstract summary: Diffusion models can inadvertently generate harmful or undesirable content.
Recent machine unlearning approaches offer potential solutions but often lack transparency.
We introduce SAeUron, a novel method leveraging features learned by sparse autoencoders.
- Score: 4.013156524547073
- License:
- Abstract: Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.
Related papers
- Concept Steerers: Leveraging K-Sparse Autoencoders for Controllable Generations [10.86252546314626]
Text-to-image generative models are prone to adversarial attacks and inadvertently generate unsafe, unethical content.
We propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation.
Our method yields an improvement of $mathbf20.01%$ in unsafe concept removal, is effective in style manipulation, and is $mathbfsim5$x faster than current state-of-the-art.
arXiv Detail & Related papers (2025-01-31T11:52:47Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models [57.16056181201623]
Fine-tuning text-to-image diffusion models can inadvertently undo safety measures, causing models to relearn harmful concepts.
We present a novel but immediate solution called Modular LoRA, which involves training Safety Low-Rank Adaptation modules separately from Fine-Tuning LoRA components.
This method effectively prevents the re-learning of harmful content without compromising the model's performance on new tasks.
arXiv Detail & Related papers (2024-11-30T04:37:38Z) - Erasing Undesirable Concepts in Diffusion Models with Adversarial Preservation [22.3077678575067]
Diffusion models excel at generating visually striking content from text but can inadvertently produce undesirable or harmful content when trained on unfiltered internet data.
We propose to identify and preserving concepts most affected by parameter changes, termed as textitadversarial concepts.
We demonstrate the effectiveness of our method using the Stable Diffusion model, showing that it outperforms state-of-the-art erasure methods in eliminating unwanted content.
arXiv Detail & Related papers (2024-10-21T03:40:29Z) - SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation [65.30207993362595]
Unlearning/editing-based methods for safe generation remove harmful concepts from models but face several challenges.
We propose SAFREE, a training-free approach for safe T2I and T2V.
We detect a subspace corresponding to a set of toxic concepts in the text embedding space and steer prompt embeddings away from this subspace.
arXiv Detail & Related papers (2024-10-16T17:32:23Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Probing Unlearned Diffusion Models: A Transferable Adversarial Attack Perspective [20.263233740360022]
Unlearning methods have been developed to erase concepts from diffusion models.
This paper aims to leverage the transferability of the adversarial attack to probe the unlearning robustness under a black-box setting.
Specifically, we employ an adversarial search strategy to search for the adversarial embedding which can transfer across different unlearned models.
arXiv Detail & Related papers (2024-04-30T09:14:54Z) - Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts [23.04942433104886]
We introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users.
Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module.
This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it.
arXiv Detail & Related papers (2024-03-18T23:42:04Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Towards Safe Self-Distillation of Internet-Scale Text-to-Image Diffusion
Models [63.20512617502273]
We propose a method called SDD to prevent problematic content generation in text-to-image diffusion models.
Our method eliminates a much greater proportion of harmful content from the generated images without degrading the overall image quality.
arXiv Detail & Related papers (2023-07-12T07:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.