Related papers: Unveiling Transformers with LEGO: a synthetic reasoning task

Unveiling Transformers with LEGO: a synthetic reasoning task

URL: http://arxiv.org/abs/2206.04301v1
Date: Thu, 9 Jun 2022 06:30:17 GMT
Title: Unveiling Transformers with LEGO: a synthetic reasoning task
Authors: Yi Zhang, Arturs Backurs, S\'ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Tal Wagner
Abstract summary: We study how the transformer architecture learns to follow a chain of reasoning. In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning. We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
Score: 23.535488809197787
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a synthetic task, LEGO (Learning Equality and Group Operations), that encapsulates the problem of following a chain of reasoning, and we study how the transformer architecture learns this task. We pay special attention to data effects such as pretraining (on seemingly unrelated NLP tasks) and dataset composition (e.g., differing chain length at training and test time), as well as architectural variants such as weight-tied layers or adding convolutional components. We study how the trained models eventually succeed at the task, and in particular, we are able to understand (to some extent) some of the attention heads as well as how the information flows in the network. Based on these observations we propose a hypothesis that here pretraining helps merely due to being a smart initialization rather than some deep knowledge stored in the network. We also observe that in some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning, which impedes the model's ability to generalize to simple variants of the main task, and moreover we find that one can prevent such shortcut with appropriate architecture modification or careful data preparation. Motivated by our findings, we begin to explore the task of learning to execute C programs, where a convolutional modification to transformers, namely adding convolutional structures in the key/query/value maps, shows an encouraging edge.

Related papers

Task Vectors in In-Context Learning: Emergence, Formation, and Benefit [17.72043522825441]
We investigate the formation of task vectors in a controlled setting using models trained from scratch on synthetic datasets. Our findings confirm that task vectors naturally emerge under certain conditions, but the tasks may be relatively weakly and/or non-locally encoded within the model. To promote strong task vectors encoded at a prescribed location within the model, we propose an auxiliary training mechanism based on a task vector prompting loss.
arXiv Detail & Related papers (2025-01-16T01:54:23Z)
Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers [16.26331213222281]
We investigate how architectural design choices influence the space of solutions that a transformer can implement and learn. We characterize two different counting strategies that small transformers can implement theoretically. Our findings highlight that even in simple settings, slight variations in model design can cause significant changes to the solutions a transformer learns.
arXiv Detail & Related papers (2024-07-16T09:48:10Z)
Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z)
Initialization is Critical to Whether Transformers Fit Composite Functions by Inference or Memorizing [10.206921909332006]
Transformers have shown impressive capabilities across various tasks, but their performance on compositional problems remains a topic of debate. In this work, we investigate the mechanisms of how transformers behave on unseen compositional tasks.
arXiv Detail & Related papers (2024-05-08T20:23:24Z)
When can transformers reason with abstract symbols? [25.63285482210457]
We prove that for any relational reasoning task in a large family of tasks, transformers learn the abstract relations and generalize to the test set. This is in contrast to classical fully-connected networks, which we prove fail to learn to reason.
arXiv Detail & Related papers (2023-10-15T06:45:38Z)
Can Transformers Learn to Solve Problems Recursively? [9.5623664764386]
This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions.
arXiv Detail & Related papers (2023-05-24T04:08:37Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples. In this work we investigate few-shot learning in the setting where the data points are sequences of tokens. We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z)
CausalWorld: A Robotic Manipulation Benchmark for Causal Structure and Transfer Learning [138.40338621974954]
CausalWorld is a benchmark for causal structure and transfer learning in a robotic manipulation environment. Tasks consist of constructing 3D shapes from a given set of blocks - inspired by how children learn to build complex structures.
arXiv Detail & Related papers (2020-10-08T23:01:13Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.