Related papers: SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders

SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders

URL: http://arxiv.org/abs/2509.21379v1
Date: Tue, 23 Sep 2025 11:29:30 GMT
Title: SAEmnesia: Erasing Concepts in Diffusion Models with Sparse Autoencoders
Authors: Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto,
Abstract summary: SAEmnesia is a supervised sparse autoencoder training method that promotes one-to-one concept-neuron mappings.<n>Our approach learns specialized neurons with significantly stronger concept associations compared to unsupervised baselines.
Score: 6.6477077425454745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective concept unlearning in text-to-image diffusion models requires precise localization of concept representations within the model's latent space. While sparse autoencoders successfully reduce neuron polysemanticity (i.e., multiple concepts per neuron) compared to the original network, individual concept representations can still be distributed across multiple latent features, requiring extensive search procedures for concept unlearning. We introduce SAEmnesia, a supervised sparse autoencoder training method that promotes one-to-one concept-neuron mappings through systematic concept labeling, mitigating feature splitting and promoting feature centralization. Our approach learns specialized neurons with significantly stronger concept associations compared to unsupervised baselines. The only computational overhead introduced by SAEmnesia is limited to cross-entropy computation during training. At inference time, this interpretable representation reduces hyperparameter search by 96.67% with respect to current approaches. On the UnlearnCanvas benchmark, SAEmnesia achieves a 9.22% improvement over the state-of-the-art. In sequential unlearning tasks, we demonstrate superior scalability with a 28.4% improvement in unlearning accuracy for 9-object removal.

Related papers

Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking [29.62352462254763]
Forget It All (FIA) is a framework for selectively erasing unwanted concepts from a pre-trained model.<n>FIA achieves more reliable multi-concept unlearning, improving effectiveness while maintaining semantic fidelity and image quality.
arXiv Detail & Related papers (2026-01-07T00:13:36Z)
Sparse Autoencoder Neural Operators: Model Recovery in Function Spaces [75.45093712182624]
We introduce a framework that extends sparse autoencoders (SAEs) to lifted spaces and infinite-dimensional function spaces, enabling mechanistic interpretability of large neural operators (NO)<n>We compare the inference and training dynamics of SAEs, lifted-SAE, and SAE neural operators.<n>We highlight how lifting and operator modules introduce beneficial inductive biases, enabling faster recovery, improved recovery of smooth concepts, and robust inference across varying resolutions, a property unique to neural operators.
arXiv Detail & Related papers (2025-09-03T21:57:03Z)
Evaluating Sparse Autoencoders for Monosemantic Representation [10.22895453657019]
A key barrier to interpreting large language models is polysemanticity.<n>We show that SAEs reduce polysemanticity and achieve higher concept separability.
arXiv Detail & Related papers (2025-08-20T22:08:01Z)
Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
Concept-Guided Interpretability via Neural Chunking [64.6429903327095]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract recurring chunks on a neural population level.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z)
SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders [4.013156524547073]
Diffusion models can inadvertently generate harmful or undesirable content.<n>Recent machine unlearning approaches offer potential solutions but often lack transparency.<n>We introduce SAeUron, a novel method leveraging features learned by sparse autoencoders to remove unwanted concepts.
arXiv Detail & Related papers (2025-01-29T23:29:47Z)
Growing Deep Neural Network Considering with Similarity between Neurons [4.32776344138537]
We explore a novel approach of progressively increasing neuron numbers in compact models during training phases. We propose a method that reduces feature extraction biases and neuronal redundancy by introducing constraints based on neuron similarity distributions. Results on CIFAR-10 and CIFAR-100 datasets demonstrated accuracy improvement.
arXiv Detail & Related papers (2024-08-23T11:16:37Z)
Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation [0.0]
Out-of-distribution generalization in neural networks is often hampered by spurious correlations. Existing concept-removal methods tend to be overzealous by inadvertently eliminating features associated with the main task of the model. We propose an iterative algorithm that separates spurious from main-task concepts by jointly identifying two low-dimensional subspaces in the neural network representation.
arXiv Detail & Related papers (2023-10-18T14:22:36Z)
Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation [49.44309457870649]
Layer-wise Feedback feedback (LFP) is a novel training principle for neural network-like predictors.<n>LFP decomposes a reward to individual neurons based on their respective contributions.<n>Our method then implements a greedy reinforcing approach helpful parts of the network and weakening harmful ones.
arXiv Detail & Related papers (2023-08-23T10:48:28Z)
A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers. Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module. Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
Training Feedback Spiking Neural Networks by Implicit Differentiation on the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware. Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks. We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z)
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks. We explore the diversity of the neurons within the hidden layer during the learning process. We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.