Inducing Systematicity in Transformers by Attending to Structurally
Quantized Embeddings
- URL: http://arxiv.org/abs/2402.06492v1
- Date: Fri, 9 Feb 2024 15:53:15 GMT
- Title: Inducing Systematicity in Transformers by Attending to Structurally
Quantized Embeddings
- Authors: Yichen Jiang, Xiang Zhou, Mohit Bansal
- Abstract summary: Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset.
We propose SQ-Transformer that explicitly encourages systematicity in the embeddings and attention layers.
We show that SQ-Transformer achieves stronger compositional generalization than the vanilla Transformer on multiple low-complexity semantic parsing and machine translation datasets.
- Score: 60.698130703909804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers generalize to novel compositions of structures and entities
after being trained on a complex dataset, but easily overfit on datasets of
insufficient complexity. We observe that when the training set is sufficiently
complex, the model encodes sentences that have a common syntactic structure
using a systematic attention pattern. Inspired by this observation, we propose
SQ-Transformer (Structurally Quantized) that explicitly encourages
systematicity in the embeddings and attention layers, even with a training set
of low complexity. At the embedding level, we introduce Structure-oriented
Vector Quantization (SoVQ) to cluster word embeddings into several classes of
structurally equivalent entities. At the attention level, we devise the
Systematic Attention Layer (SAL) and an alternative, Systematically Regularized
Layer (SRL) that operate on the quantized word embeddings so that sentences of
the same structure are encoded with invariant or similar attention patterns.
Empirically, we show that SQ-Transformer achieves stronger compositional
generalization than the vanilla Transformer on multiple low-complexity semantic
parsing and machine translation datasets. In our analysis, we show that SoVQ
indeed learns a syntactically clustered embedding space and SAL/SRL induces
generalizable attention patterns, which lead to improved systematicity.
Related papers
- Learning Syntax Without Planting Trees: Understanding When and Why Transformers Generalize Hierarchically [74.96551626420188]
Transformers trained on natural language data have been shown to learn its hierarchical structure and generalize to sentences with unseen syntactic structures.
We investigate sources of inductive bias in transformer models and their training that could cause such generalization behavior to emerge.
arXiv Detail & Related papers (2024-04-25T07:10:29Z) - Graph-Induced Syntactic-Semantic Spaces in Transformer-Based Variational
AutoEncoders [5.037881619912574]
In this paper, we investigate latent space separation methods for structural syntactic injection in Transformer-based VAEs.
Specifically, we explore how syntactic structures can be leveraged in the encoding stage through the integration of graph-based and sequential models.
Our empirical evaluation, carried out on natural language sentences and mathematical expressions, reveals that the proposed end-to-end VAE architecture can result in a better overall organisation of the latent space.
arXiv Detail & Related papers (2023-11-14T22:47:23Z) - How Do Transformers Learn In-Context Beyond Simple Functions? A Case
Study on Learning with Representations [98.7450564309923]
This paper takes initial steps on understanding in-context learning (ICL) in more complex scenarios, by studying learning with representations.
We construct synthetic in-context learning problems with a compositional structure, where the label depends on the input through a possibly complex but fixed representation function.
We show theoretically the existence of transformers that approximately implement such algorithms with mild depth and size.
arXiv Detail & Related papers (2023-10-16T17:40:49Z) - DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained
Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states.
We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs.
Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z) - Forming Trees with Treeformers [3.8073142980733]
Many state-of-the-art neural networks models such as Transformers have no explicit hierarchical structure in its architecture.
We introduce Treeformer, a general-purpose encoder module inspired by the CKY algorithm.
Our experiments demonstrate the benefits of incorporating hierarchical structure into the Transformer.
arXiv Detail & Related papers (2022-07-14T14:39:30Z) - Transformer Grammars: Augmenting Transformer Language Models with
Syntactic Inductive Biases at Scale [31.293175512404172]
We introduce Transformer Grammars -- a class of Transformer language models that combine expressive power, scalability, and strong performance of Transformers.
We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics.
arXiv Detail & Related papers (2022-03-01T17:22:31Z) - Inducing Transformer's Compositional Generalization Ability via
Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions.
Existing neural models have been shown to lack this basic ability in learning symbolic structures.
We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z) - Iterated learning for emergent systematicity in VQA [3.977144385787228]
neural module networks have an architectural bias towards compositionality.
When learning layouts and modules jointly, compositionality does not arise automatically and an explicit pressure is necessary for the emergence of layouts exhibiting the right structure.
We propose to address this problem using iterated learning, a cognitive science theory of the emergence of compositional languages in nature.
arXiv Detail & Related papers (2021-05-03T18:44:06Z) - Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity.
Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.