Transformers Struggle to Learn to Search
- URL: http://arxiv.org/abs/2412.04703v1
- Date: Fri, 06 Dec 2024 01:29:24 GMT
- Title: Transformers Struggle to Learn to Search
- Authors: Abulhair Saparov, Srushti Pawar, Shreyas Pimpalgaonkar, Nitish Joshi, Richard Yuanzhe Pang, Vishakh Padmakumar, Seyed Mehran Kazemi, Najoung Kim, He He,
- Abstract summary: We use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers.<n>We find that, when given the right training distribution, the transformer is able to learn to search.<n>We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs.
- Score: 32.231381064112085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Search is an ability foundational in many important tasks, and recent studies have shown that large language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use the foundational graph connectivity problem as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, when given the right training distribution, the transformer is able to learn to search. We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers. However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities. We also find that performing search in-context (i.e., chain-of-thought) does not resolve this inability to learn to search on larger graphs.
Related papers
- A Theory for Compressibility of Graph Transformers for Transductive Learning [6.298115235439078]
Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks.
All train/test/validation samples are present during training, making them more akin to a semi-supervised task.
We establish some theoretical bounds on how and under what conditions the hidden dimension of these networks can be compressed.
arXiv Detail & Related papers (2024-11-20T04:20:17Z) - One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule.
A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z) - Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning? [69.4145579827826]
We show a fast flow on the regression loss despite the gradient non-ity algorithms for our convergence landscape.
This is the first theoretical analysis for multi-layer Transformer in this setting.
arXiv Detail & Related papers (2024-10-10T18:29:05Z) - Understanding Transformer Reasoning Capabilities via Graph Algorithms [25.08208816144745]
We study which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems.
Our results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.
arXiv Detail & Related papers (2024-05-28T18:31:14Z) - How Transformers Learn Causal Structure with Gradient Descent [44.31729147722701]
Self-attention allows transformers to encode causal structure.
We introduce an in-context learning task that requires learning latent causal structure.
We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z) - Are More Layers Beneficial to Graph Transformers? [97.05661983225603]
Current graph transformers suffer from the bottleneck of improving performance by increasing depth.
Deep graph transformers are limited by the vanishing capacity of global attention.
We propose a novel graph transformer model named DeepGraph that explicitly employs substructure tokens in the encoded representation.
arXiv Detail & Related papers (2023-03-01T15:22:40Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - Gophormer: Ego-Graph Transformer for Node Classification [27.491500255498845]
In this paper, we propose a novel Gophormer model which applies transformers on ego-graphs instead of full-graphs.
Specifically, Node2Seq module is proposed to sample ego-graphs as the input of transformers, which alleviates the challenge of scalability.
In order to handle the uncertainty introduced by the ego-graph sampling, we propose a consistency regularization and a multi-sample inference strategy.
arXiv Detail & Related papers (2021-10-25T16:43:32Z) - AutoTrans: Automating Transformer Design via Reinforced Architecture
Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.