Related papers: The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

URL: http://arxiv.org/abs/2507.08802v1
Date: Fri, 11 Jul 2025 17:59:55 GMT
Title: The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?
Authors: Denis Sutter, Julian Minder, Thomas Hofmann, Tiago Pimentel,
Abstract summary: We critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps.<n>We show that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task.<n>This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no way to balance the inherent trade-off between these maps' complexity and accuracy.
Score: 36.38298679687864
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The concept of causal abstraction got recently popularised to demystify the opaque decision-making processes of machine learning models; in short, a neural network can be abstracted as a higher-level algorithm if there exists a function which allows us to map between them. Notably, most interpretability papers implement these maps as linear functions, motivated by the linear representation hypothesis: the idea that features are encoded linearly in a model's representations. However, this linearity constraint is not required by the definition of causal abstraction. In this work, we critically examine the concept of causal abstraction by considering arbitrarily powerful alignment maps. In particular, we prove that under reasonable assumptions, any neural network can be mapped to any algorithm, rendering this unrestricted notion of causal abstraction trivial and uninformative. We complement these theoretical findings with empirical evidence, demonstrating that it is possible to perfectly map models to algorithms even when these models are incapable of solving the actual task; e.g., on an experiment using randomly initialised language models, our alignment maps reach 100% interchange-intervention accuracy on the indirect object identification task. This raises the non-linear representation dilemma: if we lift the linearity constraint imposed to alignment maps in causal abstraction analyses, we are left with no principled way to balance the inherent trade-off between these maps' complexity and accuracy. Together, these results suggest an answer to our title's question: causal abstraction is not enough for mechanistic interpretability, as it becomes vacuous without assumptions about how models encode information. Studying the connection between this information-encoding assumption and causal abstraction should lead to exciting future work.

Related papers

Abstraction requires breadth: a renormalisation group approach [0.0]
We argue that the level of abstraction depends crucially on how broad the training set is.<n>We take the unique fixed point of this transformation -- the Hierarchical Feature Model -- as a candidate for an abstract representation.
arXiv Detail & Related papers (2024-07-01T14:13:11Z)
Neural Causal Abstractions [63.21695740637627]
We develop a new family of causal abstractions by clustering variables and their domains. We show that such abstractions are learnable in practical settings through Neural Causal Models. Our experiments support the theory and illustrate how to scale causal inferences to high-dimensional settings involving image data.
arXiv Detail & Related papers (2024-01-05T02:00:27Z)
Meaning Representations from Trajectories in Autoregressive Models [106.63181745054571]
We propose to extract meaning representations from autoregressive language models by considering the distribution of all possible trajectories extending an input text. This strategy is prompt-free, does not require fine-tuning, and is applicable to any pre-trained autoregressive model. We empirically show that the representations obtained from large models align well with human annotations, outperform other zero-shot and prompt-free methods on semantic similarity tasks, and can be used to solve more complex entailment and containment tasks that standard embeddings cannot handle.
arXiv Detail & Related papers (2023-10-23T04:35:58Z)
On the Trade-off Between Efficiency and Precision of Neural Abstraction [62.046646433536104]
Neural abstractions have been recently introduced as formal approximations of complex, nonlinear dynamical models. We employ formal inductive synthesis procedures to generate neural abstractions that result in dynamical models with these semantics.
arXiv Detail & Related papers (2023-07-28T13:22:32Z)
Towards Computing an Optimal Abstraction for Structural Causal Models [16.17846886492361]
We focus on the problem of learning abstractions. We suggest a concrete measure of information loss, and we illustrate its contribution to learning new abstractions.
arXiv Detail & Related papers (2022-08-01T14:35:57Z)
Linear Adversarial Concept Erasure [108.37226654006153]
We formulate the problem of identifying and erasing a linear subspace that corresponds to a given concept.<n>We show that the method is highly expressive, effectively mitigating bias in deep nonlinear classifiers while maintaining tractability and interpretability.
arXiv Detail & Related papers (2022-01-28T13:00:17Z)
Towards a Mathematical Theory of Abstraction [0.0]
We provide a precise characterisation of what an abstraction is and, perhaps more importantly, suggest how abstractions can be learnt directly from data. Our results have deep implications for statistical inference and machine learning and could be used to develop explicit methods for learning precise kinds of abstractions directly from data.
arXiv Detail & Related papers (2021-06-03T13:23:49Z)
Structural Causal Models Are (Solvable by) Credal Networks [70.45873402967297]
Causal inferences can be obtained by standard algorithms for the updating of credal nets. This contribution should be regarded as a systematic approach to represent structural causal models by credal networks. Experiments show that approximate algorithms for credal networks can immediately be used to do causal inference in real-size problems.
arXiv Detail & Related papers (2020-08-02T11:19:36Z)
Random thoughts about Complexity, Data and Models [0.0]
Data Science and Machine learning have been growing strong for the past decade. We investigate the subtle relation between "data and models" Key issue for appraisal of relation between algorithmic complexity and algorithmic learning is to do with concepts of compressibility, determinism and predictability.
arXiv Detail & Related papers (2020-04-16T14:27:22Z)
Extracting Semantic Indoor Maps from Occupancy Grids [2.4214518935746185]
We focus on the semantic mapping of indoor environments. We propose a method to extract an abstracted floor plan from typical grid maps using Bayesian reasoning. We demonstrate the effectiveness of the approach using real-world data.
arXiv Detail & Related papers (2020-02-19T18:52:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.