Related papers: Transformer Is Inherently a Causal Learner

Transformer Is Inherently a Causal Learner

URL: http://arxiv.org/abs/2601.05647v1
Date: Fri, 09 Jan 2026 09:10:04 GMT
Title: Transformer Is Inherently a Causal Learner
Authors: Xinyue Wang, Stephen Wang, Biwei Huang,
Abstract summary: We show that transformer trained in an autoregressive manner naturally encodes time-delayed causal structures.<n>We prove this connection theoretically under standard identifiability conditions.<n>This approach greatly surpasses the performance of state-of-the-art discovery algorithms.
Score: 27.79148022495734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.

Related papers

Adjustment for Confounding using Pre-Trained Representations [2.916285040262091]
We investigate how latent features from pre-trained neural networks can be leveraged to adjust for sources of confounding.<n>We show that neural networks can achieve fast convergence rates by adapting to intrinsic notions of sparsity and dimension of the learning problem.
arXiv Detail & Related papers (2025-06-17T09:11:17Z)
Solving Inverse Problems with FLAIR [68.87167940623318]
We present FLAIR, a training-free variational framework that leverages flow-based generative models as prior for inverse problems.<n>Results on standard imaging benchmarks demonstrate that FLAIR consistently outperforms existing diffusion- and flow-based methods in terms of reconstruction quality and sample diversity.
arXiv Detail & Related papers (2025-06-03T09:29:47Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data.<n>Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Differentiable Causal Discovery For Latent Hierarchical Causal Models [19.373348700715578]
We present new theoretical results on the identifiability of nonlinear latent hierarchical causal models.<n>We develop a novel differentiable causal discovery algorithm that efficiently estimates the structure of such models.
arXiv Detail & Related papers (2024-11-29T09:08:20Z)
A Temporally Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation [35.46631415365955]
We introduce a conditional diffusion framework called C$2$TSD, which incorporates disentangled temporal (trend and seasonality) representations as conditional information. Our experiments on three real-world datasets demonstrate the superior performance of our approach compared to a number of state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-18T11:59:04Z)
On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics. The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z)
Identifiable Latent Polynomial Causal Models Through the Lens of Change [82.14087963690561]
Causal representation learning aims to unveil latent high-level causal representations from observed low-level data.<n>One of its primary tasks is to provide reliable assurance of identifying these latent causal models, known as identifiability.
arXiv Detail & Related papers (2023-10-24T07:46:10Z)
Score-based Causal Representation Learning with Interventions [54.735484409244386]
This paper studies the causal representation learning problem when latent causal variables are observed indirectly. The objectives are: (i) recovering the unknown linear transformation (up to scaling) and (ii) determining the directed acyclic graph (DAG) underlying the latent variables.
arXiv Detail & Related papers (2023-01-19T18:39:48Z)
Principled Knowledge Extrapolation with GANs [92.62635018136476]
We study counterfactual synthesis from a new perspective of knowledge extrapolation. We show that an adversarial game with a closed-form discriminator can be used to address the knowledge extrapolation problem. Our method enjoys both elegant theoretical guarantees and superior performance in many scenarios.
arXiv Detail & Related papers (2022-05-21T08:39:42Z)
Disentangling Generative Factors of Physical Fields Using Variational Autoencoders [0.0]
This work explores the use of variational autoencoders (VAEs) for non-linear dimension reduction. A disentangled decomposition is interpretable and can be transferred to a variety of tasks including generative modeling.
arXiv Detail & Related papers (2021-09-15T16:02:43Z)
Identification of Latent Variables From Graphical Model Residuals [0.0]
We present a novel method to control for the latent space when estimating a DAG by iteratively deriving proxies for the latent space from the residuals of the inferred model. We show that any improvement of prediction of an outcome is intrinsically capped and cannot rise beyond a certain limit as compared to the confounded model.
arXiv Detail & Related papers (2021-01-07T02:28:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.