Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks
- URL: http://arxiv.org/abs/2210.00400v1
- Date: Sun, 2 Oct 2022 00:46:36 GMT
- Title: Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks
- Authors: Yuxuan Li and James L. McClelland
- Abstract summary: We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
- Score: 6.525090891505941
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer networks have seen great success in natural language processing
and machine vision, where task objectives such as next word prediction and
image classification benefit from nuanced context sensitivity across
high-dimensional inputs. However, there is an ongoing debate about how and when
transformers can acquire highly structured behavior and achieve systematic
generalization. Here, we explore how well a causal transformer can perform a
set of algorithmic tasks, including copying, sorting, and hierarchical
compositions of these operations. We demonstrate strong generalization to
sequences longer than those used in training by replacing the standard
positional encoding typically used in transformers with labels arbitrarily
paired with items in the sequence. By finding the layer and head configuration
sufficient to solve the task, then performing ablation experiments and
representation analysis, we show that two-layer transformers learn
generalizable solutions to multi-level problems and develop signs of systematic
task decomposition. They also exploit shared computation across related tasks.
These results provide key insights into how transformer models may be capable
of decomposing complex decisions into reusable, multi-level policies in tasks
requiring structured behavior.
Related papers
- Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge.
The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment.
We propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
arXiv Detail & Related papers (2025-01-01T07:05:32Z) - In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Attention as a Hypernetwork [22.087242869138223]
Transformers can generalize to novel problem instances whose constituent parts might have been encountered during training, but whose compositions have not.
By reformulating multi-head attention as a hypernetwork, we reveal that a composable, low-dimensional latent code specifies key-Query specific operations.
We find that this latent code is predictive of the subtasks the network performs on unseen task compositions, revealing that latent codes acquired during training are reused to solve unseen problem instances.
arXiv Detail & Related papers (2024-06-09T15:08:00Z) - Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.
We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z) - AlgoFormer: An Efficient Transformer Framework with Algorithmic Structures [80.28359222380733]
We design a novel transformer framework, dubbed AlgoFormer, to empower transformers with algorithmic capabilities.
In particular, inspired by the structure of human-designed learning algorithms, our transformer framework consists of a pre-transformer that is responsible for task preprocessing.
Some theoretical and empirical results are presented to show that the designed transformer has the potential to perform algorithm representation and learning.
arXiv Detail & Related papers (2024-02-21T07:07:54Z) - What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z) - Adaptivity and Modularity for Efficient Generalization Over Task
Complexity [42.748898521364914]
We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential steps.
We propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers.
arXiv Detail & Related papers (2023-10-13T05:29:09Z) - When Can Transformers Ground and Compose: Insights from Compositional
Generalization Benchmarks [7.4726048754587415]
Humans can reason compositionally whilst grounding language utterances to the real world.
Recent benchmarks like ReaSCAN use navigation tasks grounded in a grid world to assess whether neural models exhibit similar capabilities.
We present a simple transformer-based model that outperforms specialized architectures on ReaSCAN and a modified version of gSCAN.
arXiv Detail & Related papers (2022-10-23T17:03:55Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.