Related papers: LEACE: Perfect linear concept erasure in closed form

LEACE: Perfect linear concept erasure in closed form

URL: http://arxiv.org/abs/2306.03819v3
Date: Sun, 29 Oct 2023 21:41:46 GMT
Title: LEACE: Perfect linear concept erasure in closed form
Authors: Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, Stella Biderman
Abstract summary: Concept erasure aims to remove specified features from a representation. We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network.
Score: 103.61624393221447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.

Related papers

Align-then-Unlearn: Embedding Alignment for LLM Unlearning [41.94295877935867]
Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content.<n>We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space.
arXiv Detail & Related papers (2025-06-16T07:48:01Z)
ACE: Attentional Concept Erasure in Diffusion Models [0.0]
Attentional Concept Erasure integrates a closed-form attention manipulation with lightweight fine-tuning. We show that ACE achieves state-of-the-art concept removal efficacy and robustness. Compared to prior methods, ACE better balances generality (erasing concept and related terms) and specificity (preserving unrelated content)
arXiv Detail & Related papers (2025-04-16T08:16:28Z)
Fine-Grained Erasure in Text-to-Image Diffusion-based Foundation Models [56.35484513848296]
FADE (Fine grained Attenuation for Diffusion Erasure) is an adjacency-aware unlearning algorithm for text-to-image generative models. It removes target concepts with minimal impact on correlated concepts, achieving a 12% improvement in retention performance over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-25T15:49:48Z)
Sparse Autoencoder as a Zero-Shot Classifier for Concept Erasing in Text-to-Image Diffusion Models [24.15603438969762]
Interpret then Deactivate (ItD) is a novel framework to enable precise concept removal in T2I diffusion models. ItD uses a sparse autoencoder to interpret each concept as a combination of multiple features. It can be easily extended to erase multiple concepts without requiring further training.
arXiv Detail & Related papers (2025-03-12T14:46:40Z)
Scaling Concept With Text-Guided Diffusion Models [53.80799139331966]
Instead of replacing a concept, can we enhance or suppress the concept itself? We introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements. More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains.
arXiv Detail & Related papers (2024-10-31T17:09:55Z)
Collapsed Language Models Promote Fairness [88.48232731113306]
We find that debiased language models exhibit collapsed alignment between token representations and word embeddings. We design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods.
arXiv Detail & Related papers (2024-10-06T13:09:48Z)
Explain via Any Concept: Concept Bottleneck Model with Open Vocabulary Concepts [8.028021897214238]
"OpenCBM" is the first CBM with concepts of open vocabularies. Our model significantly outperforms the previous state-of-the-art CBM by 9% in the classification accuracy on the benchmark dataset CUB-200-2011.
arXiv Detail & Related papers (2024-08-05T06:42:00Z)
Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery [52.498055901649025]
Concept Bottleneck Models (CBMs) have been proposed to address the 'black-box' problem of deep neural networks. We propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model.
arXiv Detail & Related papers (2024-07-19T17:50:11Z)
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning. To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers. The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z)
Concept Distillation: Leveraging Human-Centered Explanations for Model Improvement [3.026365073195727]
Concept Activation Vectors (CAVs) estimate a model's sensitivity and possible biases to a given concept. We extend CAVs from post-hoc analysis to ante-hoc training in order to reduce model bias through fine-tuning. We show applications of concept-sensitive training to debias several classification problems.
arXiv Detail & Related papers (2023-11-26T14:00:14Z)
Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z)
A Geometric Notion of Causal Probing [91.14470073637236]
In a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. We give a set of intrinsic criteria which characterize an ideal linear concept subspace. We find that LEACE returns a one-dimensional subspace containing roughly half of total concept information.
arXiv Detail & Related papers (2023-07-27T17:57:57Z)
Probing Classifiers are Unreliable for Concept Removal and Detection [18.25734277357466]
Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. We show that these methods can be counter-productive, and in the worst case may end up destroying all task-relevant features.
arXiv Detail & Related papers (2022-07-08T23:15:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.