Related papers: Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

Pushdown Layers: Encoding Recursive Structure in Transformer Language Models

URL: http://arxiv.org/abs/2310.19089v1
Date: Sun, 29 Oct 2023 17:27:18 GMT
Title: Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
Authors: Shikhar Murty, Pratyusha Sharma, Jacob Andreas, Christopher D. Manning
Abstract summary: Recursion is a prominent feature of human language, and fundamentally challenging for self-attention. This work introduces Pushdown Layers, a new self-attention layer. Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization.
Score: 86.75729087623259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail recursive structure and exhibit sample-inefficient syntactic generalization. This work introduces Pushdown Layers, a new self-attention layer that models recursive state via a stack tape that tracks estimated depths of every token in an incremental parse of the observed prefix. Transformer LMs with Pushdown Layers are syntactic language models that autoregressively and synchronously update this stack tape as they predict new tokens, in turn using the stack tape to softly modulate attention over tokens -- for instance, learning to "skip" over closed constituents. When trained on a corpus of strings annotated with silver constituency parses, Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization, while maintaining similar perplexities. Pushdown Layers are a drop-in replacement for standard self-attention. We illustrate this by finetuning GPT2-medium with Pushdown Layers on an automatically parsed WikiText-103, leading to improvements on several GLUE text classification tasks.

Related papers

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech [9.982121768809854]
We introduce enhancements aimed at AR Transformer-based encoder-decoder text-to-speech systems. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system.
arXiv Detail & Related papers (2024-10-29T16:17:01Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
Improving language models by retrieving from trillions of tokens [50.42630445476544]
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
arXiv Detail & Related papers (2021-12-08T17:32:34Z)
Semantic Parsing in Task-Oriented Dialog with Recursive Insertion-based Encoder [6.507504084891086]
We introduce a Recursive INsertion-based entity recognition (RINE) approach for semantic parsing in task-oriented dialog. RINE achieves state-of-the-art exact match accuracy on low- and high-resource versions of the conversational semantic parsing benchmark TOP. Our approach is 2-3.5 times faster than the sequence-to-sequence model at inference time.
arXiv Detail & Related papers (2021-09-09T18:23:45Z)
SIT3: Code Summarization with Structure-Induced Transformer [48.000063280183376]
We propose a novel model based on structure-induced self-attention, which encodes sequential inputs with highly-effective structure modeling. Our newly-proposed model achieves new state-of-the-art results on popular benchmarks.
arXiv Detail & Related papers (2020-12-29T11:37:43Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
Neural Syntactic Preordering for Controlled Paraphrase Generation [57.5316011554622]
Our work uses syntactic transformations to softly "reorder'' the source sentence and guide our neural paraphrasing model. First, given an input sentence, we derive a set of feasible syntactic rearrangements using an encoder-decoder model. Next, we use each proposed rearrangement to produce a sequence of position embeddings, which encourages our final encoder-decoder paraphrase model to attend to the source words in a particular order.
arXiv Detail & Related papers (2020-05-05T09:02:25Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)
Stacked DeBERT: All Attention in Incomplete Data for Text Classification [8.900866276512364]
We propose Stacked DeBERT, short for Stacked Denoising Bidirectional Representations from Transformers. Our model shows improved F1-scores and better robustness in informal/incorrect texts present in tweets and in texts with Speech-to-Text error in sentiment and intent classification tasks.
arXiv Detail & Related papers (2020-01-01T04:49:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.