Addressing divergent representations from causal interventions on neural networks
- URL: http://arxiv.org/abs/2511.04638v2
- Date: Sun, 09 Nov 2025 20:35:15 GMT
- Title: Addressing divergent representations from causal interventions on neural networks
- Authors: Satchel Grant, Simon Jerome Han, Alexa R. Tartaglini, Christopher Potts,
- Abstract summary: We show that causal intervention techniques often do shift internal representations away from the natural distribution of the target model.<n>In an effort to mitigate the pernicious cases, we modify the Counterfactual Latent loss from Grant (2025) that regularizes interventions to remain closer to the natural distributions.
- Score: 23.673327190142825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two classes of such divergences: "harmless" divergences that occur in the null-space of the weights and from covariance within behavioral decision boundaries, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we modify the Counterfactual Latent (CL) loss from Grant (2025) that regularizes interventions to remain closer to the natural distributions, reducing the likelihood of harmful divergences while preserving the interpretive power of interventions. Together, these results highlight a path towards more reliable interpretability methods.
Related papers
- On Evolution-Based Models for Experimentation Under Interference [7.262048441360133]
We study an evolution-based approach that investigates how outcomes change across observation rounds in response to interventions.<n>We highlight causal message passing as an instantiation of this method in dense networks.<n>We discuss the limits of this approach, showing that strong temporal trends or endogenous interference can undermine identification.
arXiv Detail & Related papers (2025-11-26T18:53:46Z) - Incorporating Interventional Independence Improves Robustness against Interventional Distribution Shift [14.497130575562698]
Existing approaches treat interventional data like observational data, even when the underlying causal model is known.<n>We propose RepLIn, a training algorithm to explicitly enforce this statistical independence during interventions.
arXiv Detail & Related papers (2025-07-07T18:51:20Z) - Counterfactual Realizability [52.85109506684737]
We introduce a formal definition of realizability, the ability to draw samples from a distribution, and then develop a complete algorithm to determine whether an arbitrary counterfactual distribution is realizable.<n>We illustrate the implications of this new framework for counterfactual data collection using motivating examples from causal fairness and causal reinforcement learning.
arXiv Detail & Related papers (2025-03-14T20:54:27Z) - What is causal about causal models and representations? [5.128695263114213]
Causal Bayesian networks are 'causal' models since they make predictions about interventional distributions.<n>To connect such causal model predictions to real-world outcomes, we must determine which actions in the world correspond to which interventions in the model.<n>We introduce a formal framework to make such requirements for different interpretations of actions as interventions precise.
arXiv Detail & Related papers (2025-01-31T17:35:21Z) - Towards Understanding Extrapolation: a Causal Lens [53.15488984371969]
We provide a theoretical understanding of when extrapolation is possible and offer principled methods to achieve it.<n>Under this formulation, we cast the extrapolation problem into a latent-variable identification problem.<n>Our theory reveals the intricate interplay between the underlying manifold's smoothness and the shift properties.
arXiv Detail & Related papers (2025-01-15T21:29:29Z) - Robust Domain Generalisation with Causal Invariant Bayesian Neural Networks [9.999199798941424]
We propose a Bayesian neural architecture that disentangles the learning of the the data distribution from the inference process mechanisms.
We show theoretically and experimentally that our model approximates reasoning under causal interventions.
arXiv Detail & Related papers (2024-10-08T20:38:05Z) - Identifiable Latent Neural Causal Models [82.14087963690561]
Causal representation learning seeks to uncover latent, high-level causal representations from low-level observed data.
We determine the types of distribution shifts that do contribute to the identifiability of causal representations.
We translate our findings into a practical algorithm, allowing for the acquisition of reliable latent causal representations.
arXiv Detail & Related papers (2024-03-23T04:13:55Z) - Proxy Methods for Domain Adaptation [78.03254010884783]
proxy variables allow for adaptation to distribution shift without explicitly recovering or modeling latent variables.
We develop a two-stage kernel estimation approach to adapt to complex distribution shifts in both settings.
arXiv Detail & Related papers (2024-03-12T09:32:41Z) - Differentiable Causal Discovery Under Latent Interventions [3.867363075280544]
Recent work has shown promising results in causal discovery by leveraging interventional data with gradient-based methods, even when the intervened variables are unknown.
We envision a scenario with an extensive dataset sampled from multiple intervention distributions and one observation distribution, but where we do not know which distribution originated each sample and how the intervention affected the system.
We propose a method based on neural networks and variational inference that addresses this scenario by framing it as learning a shared causal graph among an infinite mixture.
arXiv Detail & Related papers (2022-03-04T14:21:28Z) - Towards Robust and Adaptive Motion Forecasting: A Causal Representation
Perspective [72.55093886515824]
We introduce a causal formalism of motion forecasting, which casts the problem as a dynamic process with three groups of latent variables.
We devise a modular architecture that factorizes the representations of invariant mechanisms and style confounders to approximate a causal graph.
Experiment results on synthetic and real datasets show that our three proposed components significantly improve the robustness and reusability of the learned motion representations.
arXiv Detail & Related papers (2021-11-29T18:59:09Z) - Which Invariance Should We Transfer? A Causal Minimax Learning Approach [18.71316951734806]
We present a comprehensive minimax analysis from a causal perspective.
We propose an efficient algorithm to search for the subset with minimal worst-case risk.
The effectiveness and efficiency of our methods are demonstrated on synthetic data and the diagnosis of Alzheimer's disease.
arXiv Detail & Related papers (2021-07-05T09:07:29Z) - Adversarial Robustness through the Lens of Causality [105.51753064807014]
adversarial vulnerability of deep neural networks has attracted significant attention in machine learning.
We propose to incorporate causality into mitigating adversarial vulnerability.
Our method can be seen as the first attempt to leverage causality for mitigating adversarial vulnerability.
arXiv Detail & Related papers (2021-06-11T06:55:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.