Related papers: Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

URL: http://arxiv.org/abs/2310.01749v2
Date: Wed, 24 Jan 2024 16:28:43 GMT
Title: Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns
Authors: Brian DuSell and David Chiang
Abstract summary: We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.
Score: 17.144569385099462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain syntactic structures. To address this shortcoming, we propose stack attention: an attention operator that incorporates stacks, inspired by their theoretical connections to context-free languages (CFLs). We show that stack attention is analogous to standard attention, but with a latent model of syntax that requires no syntactic supervision. We propose two variants: one related to deterministic pushdown automata (PDAs) and one based on nondeterministic PDAs, which allows transformers to recognize arbitrary CFLs. We show that transformers with stack attention are very effective at learning CFLs that standard transformers struggle on, achieving strong results on a CFL with theoretically maximal parsing difficulty. We also show that stack attention is more effective at natural language modeling under a constrained parameter budget, and we include results on machine translation.

Related papers

Tensor Product Attention Is All You Need [54.40495407154611]
Product Attention (TPA) is a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly. TPA achieves improved model quality alongside memory efficiency. We introduce the ProducT ATTion Transformer (T6), a new model architecture for sequence modeling.
arXiv Detail & Related papers (2025-01-11T03:37:10Z)
Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models [49.84163262868945]
Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling. The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers. We propose parallel context encoding, which splits the context into sub-pieces and encodes them parallelly.
arXiv Detail & Related papers (2024-12-21T09:04:51Z)
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis [7.865191493201841]
Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. We propose to replace transformers with emerging recurrent architectures and introduce specialized cross-attention mechanisms for reducing repeating and skipping issues. Our architecture can be efficiently trained on long samples and achieve state-of-the-art zero-shot voice cloning against baselines of comparable size.
arXiv Detail & Related papers (2024-06-06T19:48:17Z)
A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism. Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model. We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z)
Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That! [13.120825574589437]
We show that Transformer-based neural machine translation (NMT) is very effective in high-resource settings. We show that the model does not show greater improvements for closely-related vs. more distant language pairs. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT.
arXiv Detail & Related papers (2024-03-16T16:17:47Z)
Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models. We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models. We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z)
Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge. We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences. We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z)
Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters. We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z)
Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks. In practice, it is often observed that Transformer models have better representation power than LSTM. We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.