Related papers: When Do Transformers Learn Heuristics for Graph Connectivity?

When Do Transformers Learn Heuristics for Graph Connectivity?

URL: http://arxiv.org/abs/2510.19753v1
Date: Wed, 22 Oct 2025 16:43:32 GMT
Title: When Do Transformers Learn Heuristics for Graph Connectivity?
Authors: Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan,
Abstract summary: We prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3L$.<n>We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity.
Score: 33.73385470817422
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.

Related papers

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data [4.344634631420729]
We analyze the policy gradient dynamics of single-layer Transformers trained via Reinforcement Learning.<n>We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm.
arXiv Detail & Related papers (2026-01-21T16:36:19Z)
Graph Transformers Dream of Electric Flow [72.06286909236827]
We show that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems.<n>We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers' errors by the errors of the underlying algorithms.<n>Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data.
arXiv Detail & Related papers (2024-10-22T05:11:45Z)
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape. This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Understanding Transformer Reasoning Capabilities via Graph Algorithms [25.08208816144745]
We study which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems. Our results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.
arXiv Detail & Related papers (2024-05-28T18:31:14Z)
Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z)
What learning algorithm is in-context learning? Investigations with linear models [87.91612418166464]
We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly. We show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression. Preliminary evidence that in-context learners share algorithmic features with these predictors.
arXiv Detail & Related papers (2022-11-28T18:59:51Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton. We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.