Stack Attention: Improving the Ability of Transformers to Model
Hierarchical Patterns
- URL: http://arxiv.org/abs/2310.01749v2
- Date: Wed, 24 Jan 2024 16:28:43 GMT
- Title: Stack Attention: Improving the Ability of Transformers to Model
Hierarchical Patterns
- Authors: Brian DuSell and David Chiang
- Abstract summary: We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision.
We show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
- Score: 17.144569385099462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention, specifically scaled dot-product attention, has proven effective
for natural language, but it does not have a mechanism for handling
hierarchical patterns of arbitrary nesting depth, which limits its ability to
recognize certain syntactic structures. To address this shortcoming, we propose
stack attention: an attention operator that incorporates stacks, inspired by
their theoretical connections to context-free languages (CFLs). We show that
stack attention is analogous to standard attention, but with a latent model of
syntax that requires no syntactic supervision. We propose two variants: one
related to deterministic pushdown automata (PDAs) and one based on
nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs.
We show that transformers with stack attention are very effective at learning
CFLs that standard transformers struggle on, achieving strong results on a CFL
with theoretically maximal parsing difficulty. We also show that stack
attention is more effective at natural language modeling under a constrained
parameter budget, and we include results on machine translation.
Related papers
- Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly.
TPA achieves improved model quality alongside memory efficiency.
We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z) - Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models [49.84163262868945]
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling.
The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers.
We propose parallel context encoding, which splits the context into sub-pieces and encodes them parallelly.
arXiv Detail & Related papers (2024-12-21T09:04:51Z) - Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis [7.865191493201841]
Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning.
We propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues.
Our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size.
arXiv Detail & Related papers (2024-06-06T19:48:17Z) - A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism.
Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model.
We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z) - Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That! [13.120825574589437]
We show that Transformer-based neural machine translation (NMT) is very effective in high-resource settings.
We show that the model does not show greater improvements for closely-related vs. more distant language pairs.
Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT.
arXiv Detail & Related papers (2024-03-16T16:17:47Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.