Stack Attention: Improving the Ability of Transformers to Model
Hierarchical Patterns
- URL: http://arxiv.org/abs/2310.01749v2
- Date: Wed, 24 Jan 2024 16:28:43 GMT
- Title: Stack Attention: Improving the Ability of Transformers to Model
Hierarchical Patterns
- Authors: Brian DuSell and David Chiang
- Abstract summary: We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision.
We show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
- Score: 17.144569385099462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention, specifically scaled dot-product attention, has proven effective
for natural language, but it does not have a mechanism for handling
hierarchical patterns of arbitrary nesting depth, which limits its ability to
recognize certain syntactic structures. To address this shortcoming, we propose
stack attention: an attention operator that incorporates stacks, inspired by
their theoretical connections to context-free languages (CFLs). We show that
stack attention is analogous to standard attention, but with a latent model of
syntax that requires no syntactic supervision. We propose two variants: one
related to deterministic pushdown automata (PDAs) and one based on
nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs.
We show that transformers with stack attention are very effective at learning
CFLs that standard transformers struggle on, achieving strong results on a CFL
with theoretically maximal parsing difficulty. We also show that stack
attention is more effective at natural language modeling under a constrained
parameter budget, and we include results on machine translation.
Related papers
- Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis [7.865191493201841]
Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning.
We propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues.
Our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size.
arXiv Detail & Related papers (2024-06-06T19:48:17Z) - A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism.
Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model.
We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z) - Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That! [13.120825574589437]
We show that Transformer-based neural machine translation (NMT) is very effective in high-resource settings.
We show that the model does not show greater improvements for closely-related vs. more distant language pairs.
Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT.
arXiv Detail & Related papers (2024-03-16T16:17:47Z) - Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models.
We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models.
We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.