Probing Classifiers are Unreliable for Concept Removal and Detection
- URL: http://arxiv.org/abs/2207.04153v3
- Date: Mon, 19 Jun 2023 17:37:02 GMT
- Title: Probing Classifiers are Unreliable for Concept Removal and Detection
- Authors: Abhinav Kumar, Chenhao Tan, Amit Sharma
- Abstract summary: Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation.
Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation.
We show that these methods can be counter-productive, and in the worst case may end up destroying all task-relevant features.
- Score: 18.25734277357466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural network models trained on text data have been found to encode
undesirable linguistic or sensitive concepts in their representation. Removing
such concepts is non-trivial because of a complex relationship between the
concept, text input, and the learnt representation. Recent work has proposed
post-hoc and adversarial methods to remove such unwanted concepts from a
model's representation. Through an extensive theoretical and empirical
analysis, we show that these methods can be counter-productive: they are unable
to remove the concepts entirely, and in the worst case may end up destroying
all task-relevant features. The reason is the methods' reliance on a probing
classifier as a proxy for the concept. Even under the most favorable conditions
for learning a probing classifier when a concept's relevant features in
representation space alone can provide 100% accuracy, we prove that a probing
classifier is likely to use non-concept features and thus post-hoc or
adversarial methods will fail to remove the concept correctly. These
theoretical implications are confirmed by experiments on models trained on
synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of
concept removal such as fairness, we recommend caution against using these
methods and propose a spuriousness metric to gauge the quality of the final
classifier.
Related papers
- Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery [52.498055901649025]
Concept Bottleneck Models (CBMs) have been proposed to address the 'black-box' problem of deep neural networks.
We propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm.
Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model.
arXiv Detail & Related papers (2024-07-19T17:50:11Z) - Explaining Explainability: Understanding Concept Activation Vectors [35.37586279472797]
Recent interpretability methods propose using concept-based explanations to translate internal representations of deep learning models into a language that humans are familiar with: concepts.
This requires understanding which concepts are present in the representation space of a neural network.
In this work, we investigate three properties of Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars.
We introduce tools designed to detect the presence of these properties, provide insight into how they affect the derived explanations, and provide recommendations to minimise their impact.
arXiv Detail & Related papers (2024-04-04T17:46:20Z) - Concept Distillation: Leveraging Human-Centered Explanations for Model
Improvement [3.026365073195727]
Concept Activation Vectors (CAVs) estimate a model's sensitivity and possible biases to a given concept.
We extend CAVs from post-hoc analysis to ante-hoc training in order to reduce model bias through fine-tuning.
We show applications of concept-sensitive training to debias several classification problems.
arXiv Detail & Related papers (2023-11-26T14:00:14Z) - Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text.
This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model.
We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z) - Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images.
We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z) - A Recursive Bateson-Inspired Model for the Generation of Semantic Formal
Concepts from Spatial Sensory Data [77.34726150561087]
This paper presents a new symbolic-only method for the generation of hierarchical concept structures from complex sensory data.
The approach is based on Bateson's notion of difference as the key to the genesis of an idea or a concept.
The model is able to produce fairly rich yet human-readable conceptual representations without training.
arXiv Detail & Related papers (2023-07-16T15:59:13Z) - LEACE: Perfect linear concept erasure in closed form [103.61624393221447]
Concept erasure aims to remove specified features from a representation.
We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible.
We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network.
arXiv Detail & Related papers (2023-06-06T16:07:24Z) - Statistically Significant Concept-based Explanation of Image Classifiers
via Model Knockoffs [22.576922942465142]
Concept-based explanations may cause false positives, which misregards unrelated concepts as important for the prediction task.
We propose a method using a deep learning model to learn the image concept and then using the Knockoff samples to select the important concepts for prediction.
arXiv Detail & Related papers (2023-05-27T05:40:05Z) - DISSECT: Disentangled Simultaneous Explanations via Concept Traversals [33.65478845353047]
DISSECT is a novel approach to explaining deep learning model inferences.
By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent "notion" of distinct concepts.
We show that DISSECT produces CTs that disentangle several concepts and are coupled to its reasoning due to joint training.
arXiv Detail & Related papers (2021-05-31T17:11:56Z) - Contrastive Explanations for Model Interpretability [77.92370750072831]
We propose a methodology to produce contrastive explanations for classification models.
Our method is based on projecting model representation to a latent space.
Our findings shed light on the ability of label-contrastive explanations to provide a more accurate and finer-grained interpretability of a model's decision.
arXiv Detail & Related papers (2021-03-02T00:36:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.