Compositional Capabilities of Autoregressive Transformers: A Study on
Synthetic, Interpretable Tasks
- URL: http://arxiv.org/abs/2311.12997v2
- Date: Mon, 5 Feb 2024 23:29:12 GMT
- Title: Compositional Capabilities of Autoregressive Transformers: A Study on
Synthetic, Interpretable Tasks
- Authors: Rahul Ramesh, Ekdeep Singh Lubana, Mikail Khona, Robert P. Dick,
Hidenori Tanaka
- Abstract summary: We train autoregressive Transformer models on a synthetic data-generating process.
We show that autoregressive Transformers can learn compositional structures from small amounts of training data.
- Score: 23.516986266146855
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers trained on huge text corpora exhibit a remarkable set of
capabilities, e.g., performing basic arithmetic. Given the inherent
compositional nature of language, one can expect the model to learn to compose
these capabilities, potentially yielding a combinatorial explosion of what
operations it can perform on an input. Motivated by the above, we train
autoregressive Transformer models on a synthetic data-generating process that
involves compositions of a set of well-defined monolithic capabilities. Through
a series of extensive and systematic experiments on this data-generating
process, we show that: (1) autoregressive Transformers can learn compositional
structures from small amounts of training data and generalize to exponentially
or even combinatorially many functions; (2) generating intermediate outputs
when composing functions is more effective for generalizing to new, unseen
compositions than not generating any intermediate outputs (3) biases in the
order of the compositions in the training data result in Transformers that fail
to compose some combinations of functions; and (4) the attention layers select
which capability to apply while the feed-forward layers execute the selected
capability.
Related papers
- Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in
Transformer Models [9.340409961107955]
Transformer models have the remarkable ability to perform in-context learning (ICL)
We study how effectively transformers can bridge between their pretraining data mixture.
Our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases.
arXiv Detail & Related papers (2023-11-01T21:41:08Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z) - Set Interdependence Transformer: Set-to-Sequence Neural Networks for
Permutation Learning and Structure Prediction [6.396288020763144]
Set-to-sequence problems occur in natural language processing, computer vision and structure prediction.
Previous attention-based methods require $n$ layers of their set transformations to explicitly represent $n$-th order relations.
We propose a novel neural set encoding method called the Set Interdependence Transformer, capable of relating the set's permutation invariant representation to its elements within sets of any cardinality.
arXiv Detail & Related papers (2022-06-08T07:46:49Z) - Compositional Generalization and Decomposition in Neural Program
Synthesis [59.356261137313275]
In this paper, we focus on measuring the ability of learned program synthesizers to compositionally generalize.
We first characterize several different axes along which program synthesis methods would be desired to generalize.
We introduce a benchmark suite of tasks to assess these abilities based on two popular existing datasets.
arXiv Detail & Related papers (2022-04-07T22:16:05Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.