Related papers: Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

URL: http://arxiv.org/abs/2305.03796v1
Date: Fri, 5 May 2023 18:54:40 GMT
Title: Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation
Authors: Ta-Chung Chi and Ting-Han Fan and Alexander I. Rudnicky and Peter J. Ramadge
Abstract summary: We propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension. We test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect.
Score: 72.71398034617607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of Weight-Sharing, Adaptive-Depth, and Sliding-Dilated-Attention, RegularGPT constructs working memory along the depth dimension, thereby enabling efficient and successful modeling of regular languages such as PARITY. We further test RegularGPT on the task of natural language length extrapolation and surprisingly find that it rediscovers the local windowed attention effect deemed necessary in prior work for length extrapolation.

Related papers

Exploring the Impact of a Transformer's Latent Space Geometry on Downstream Task Performance [0.0]
We propose that much of the benefit from pre-training may be captured by geometric characteristics of the latent space representations. We find that there is a strong linear relationship between a measure of quantized cell density and average GLUE performance.
arXiv Detail & Related papers (2024-06-18T00:17:30Z)
RecurrentGemma: Moving Past Transformers for Efficient Open Language Models [103.59785165735727]
We introduce RecurrentGemma, a family of open language models using Google's novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. We provide two sizes of models, containing 2B and 9B parameters, and provide pre-trained and instruction tuned variants for both.
arXiv Detail & Related papers (2024-04-11T15:27:22Z)
Extracting Definienda in Mathematical Scholarly Articles with Transformers [0.0]
We consider automatically identifying the defined term within a mathematical definition from the text of an academic article. It is possible to reach high levels of precision and recall using either recent (and expensive) GPT 4 or simpler pre-trained models fine-tuned on our task.
arXiv Detail & Related papers (2023-11-21T08:58:57Z)
Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient. We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z)
Memory Augmented Large Language Models are Computationally Universal [44.64529266193095]
We show that transformer-based large language models are computationally universal when augmented with an external memory. We establish that an existing large language model, Flan-U-PaLM 540B, can be combined with an associative read-write memory to exactly simulate the execution of a universal Turing machine.
arXiv Detail & Related papers (2023-01-10T02:37:44Z)
Lifting the Curse of Multilinguality by Pre-training Modular Transformers [72.46919537293068]
multilingual pre-trained models suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We introduce language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. Our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.
arXiv Detail & Related papers (2022-05-12T17:59:56Z)
Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning. We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z)
On the Ability and Limitations of Transformers to Recognize Formal Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages. We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z)
Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space [109.79957125584252]
Variational Autoencoder (VAE) can be both a powerful generative model and an effective representation learning framework for natural language. In this paper, we propose the first large-scale language VAE model, Optimus.
arXiv Detail & Related papers (2020-04-05T06:20:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.