Is This the Subspace You Are Looking for? An Interpretability Illusion
for Subspace Activation Patching
- URL: http://arxiv.org/abs/2311.17030v2
- Date: Wed, 6 Dec 2023 14:28:46 GMT
- Title: Is This the Subspace You Are Looking for? An Interpretability Illusion
for Subspace Activation Patching
- Authors: Aleksandar Makelov, Georg Lange, Neel Nanda
- Abstract summary: Mechanistic interpretability aims to understand model behaviors in terms of specific, interpretable features.
Recent studies have explored subspace interventions as a way to manipulate model behavior and attribute the features behind it to given subspaces.
We demonstrate that these two aims diverge, potentially leading to an illusory sense of interpretability.
- Score: 47.05588106164043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mechanistic interpretability aims to understand model behaviors in terms of
specific, interpretable features, often hypothesized to manifest as
low-dimensional subspaces of activations. Specifically, recent studies have
explored subspace interventions (such as activation patching) as a way to
simultaneously manipulate model behavior and attribute the features behind it
to given subspaces.
In this work, we demonstrate that these two aims diverge, potentially leading
to an illusory sense of interpretability. Counterintuitively, even if a
subspace intervention makes the model's output behave as if the value of a
feature was changed, this effect may be achieved by activating a dormant
parallel pathway leveraging another subspace that is causally disconnected from
model outputs. We demonstrate this phenomenon in a distilled mathematical
example, in two real-world domains (the indirect object identification task and
factual recall), and present evidence for its prevalence in practice. In the
context of factual recall, we further show a link to rank-1 fact editing,
providing a mechanistic explanation for previous work observing an
inconsistency between fact editing performance and fact localization.
However, this does not imply that activation patching of subspaces is
intrinsically unfit for interpretability. To contextualize our findings, we
also show what a success case looks like in a task (indirect object
identification) where prior manual circuit analysis informs an understanding of
the location of a feature. We explore the additional evidence needed to argue
that a patched subspace is faithful.
Related papers
- GeneralAD: Anomaly Detection Across Domains by Attending to Distorted Features [68.14842693208465]
GeneralAD is an anomaly detection framework designed to operate in semantic, near-distribution, and industrial settings.
We propose a novel self-supervised anomaly generation module that employs straightforward operations like noise addition and shuffling to patch features.
We extensively evaluated our approach on ten datasets, achieving state-of-the-art results in six and on-par performance in the remaining.
arXiv Detail & Related papers (2024-07-17T09:27:41Z) - Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models [27.618704505738425]
Contrastive vision-language models (VLMs) have gained popularity for their versatile applicability to various downstream tasks.
Despite their successes in some tasks, like zero-shot object recognition, they perform surprisingly poor on other tasks, like attribute recognition.
Previous work has attributed these challenges to the modality gap, a separation of image and text in the shared representation space, and to a bias towards objects over other factors, such as attributes.
arXiv Detail & Related papers (2024-04-11T17:58:06Z) - Identifiable Latent Neural Causal Models [82.14087963690561]
Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data.
We determine the types of distribution shifts that do contribute to the identifiability of causal representations.
We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations.
arXiv Detail & Related papers (2024-03-23T04:13:55Z) - Emergent Causality and the Foundation of Consciousness [0.0]
We argue that in the absence of a $do$ operator, an intervention can be represented by a variable.
In a narrow sense this describes what it is to be aware, and is a mechanistic explanation of aspects of consciousness.
arXiv Detail & Related papers (2023-02-07T01:41:23Z) - Nested Counterfactual Identification from Arbitrary Surrogate
Experiments [95.48089725859298]
We study the identification of nested counterfactuals from an arbitrary combination of observations and experiments.
Specifically, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones.
arXiv Detail & Related papers (2021-07-07T12:51:04Z) - Is Sparse Attention more Interpretable? [52.85910570651047]
We investigate how sparsity affects our ability to use attention as an explainability tool.
We find that only a weak relationship between inputs and co-indexed intermediate representations exists -- under sparse attention.
We observe in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.
arXiv Detail & Related papers (2021-06-02T11:42:56Z) - Where and What? Examining Interpretable Disentangled Representations [96.32813624341833]
Capturing interpretable variations has long been one of the goals in disentanglement learning.
Unlike the independence assumption, interpretability has rarely been exploited to encourage disentanglement in the unsupervised setting.
In this paper, we examine the interpretability of disentangled representations by investigating two questions: where to be interpreted and what to be interpreted.
arXiv Detail & Related papers (2021-04-07T11:22:02Z) - Disentangling Action Sequences: Discovering Correlated Samples [6.179793031975444]
We demonstrate the data itself plays a crucial role in disentanglement and instead of the factors, and the disentangled representations align the latent variables with the action sequences.
We propose a novel framework, fractional variational autoencoder (FVAE) to disentangle the action sequences with different significance step-by-step.
Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.
arXiv Detail & Related papers (2020-10-17T07:37:50Z) - A Novel Perspective to Zero-shot Learning: Towards an Alignment of
Manifold Structures via Semantic Feature Expansion [17.48923061278128]
A common practice in zero-shot learning is to train a projection between the visual and semantic feature spaces with labeled seen classes examples.
Under such a paradigm, most existing methods easily suffer from the domain shift problem and weaken the performance of zero-shot recognition.
We propose a novel model called AMS-SFE that considers the alignment of manifold structures by semantic feature expansion.
arXiv Detail & Related papers (2020-04-30T14:08:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.