Unveiling Transformers with LEGO: a synthetic reasoning task
- URL: http://arxiv.org/abs/2206.04301v1
- Date: Thu, 9 Jun 2022 06:30:17 GMT
- Title: Unveiling Transformers with LEGO: a synthetic reasoning task
- Authors: Yi Zhang, Arturs Backurs, S\'ebastien Bubeck, Ronen Eldan, Suriya
Gunasekar, Tal Wagner
- Abstract summary: We study how the transformer architecture learns to follow a chain of reasoning.
In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning.
We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
- Score: 23.535488809197787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a synthetic task, LEGO (Learning Equality and Group Operations),
that encapsulates the problem of following a chain of reasoning, and we study
how the transformer architecture learns this task. We pay special attention to
data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset
composition (e.g., differing chain length at training and test time), as well
as architectural variants such as weight-tied layers or adding convolutional
components. We study how the trained models eventually succeed at the task, and
in particular, we are able to understand (to some extent) some of the attention
heads as well as how the information flows in the network. Based on these
observations we propose a hypothesis that here pretraining helps merely due to
being a smart initialization rather than some deep knowledge stored in the
network. We also observe that in some data regime the trained transformer finds
"shortcut" solutions to follow the chain of reasoning, which impedes the
model's ability to generalize to simple variants of the main task, and moreover
we find that one can prevent such shortcut with appropriate architecture
modification or careful data preparation. Motivated by our findings, we begin
to explore the task of learning to execute C programs, where a convolutional
modification to transformers, namely adding convolutional structures in the
key/query/value maps, shows an encouraging edge.
Related papers
- Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing [10.206921909332006]
We investigate the mechanisms of how transformers behave on compositional problems.
We discover that the parameter initialization scale plays a critical role in determining whether the model learns inferential solutions.
We find that inferential solutions exhibit low complexity bias, which we hypothesize is a key factor enabling them to learn individual mappings for single anchors.
arXiv Detail & Related papers (2024-05-08T20:23:24Z) - How Transformers Learn Causal Structure with Gradient Descent [49.808194368781095]
Self-attention allows transformers to encode causal structure.
We introduce an in-context learning task that requires learning latent causal structure.
We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z) - When can transformers reason with abstract symbols? [25.63285482210457]
We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set.
This is in contrast to classical fully-connected networks, which we prove fail to learn to reason.
arXiv Detail & Related papers (2023-10-15T06:45:38Z) - Can Transformers Learn to Solve Problems Recursively? [9.5623664764386]
This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification.
By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions.
arXiv Detail & Related papers (2023-05-24T04:08:37Z) - Build generally reusable agent-environment interaction models [28.577502598559988]
This paper tackles the problem of how to pre-train a model and make it generally reusable backbones for downstream task learning.
We propose a method that builds an agent-environment interaction model by learning domain invariant successor features from the agent's vast experiences covering various tasks, then discretize them into behavior prototypes.
We provide preliminary results that show downstream task learning based on a pre-trained embodied set structure can handle unseen changes in task objectives, environmental dynamics and sensor modalities.
arXiv Detail & Related papers (2022-11-13T07:33:14Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and
Transfer Learning [138.40338621974954]
CausalWorld is a benchmark for causal structure and transfer learning in a robotic manipulation environment.
Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures.
arXiv Detail & Related papers (2020-10-08T23:01:13Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.