Multi-scale Transformer Language Models
- URL: http://arxiv.org/abs/2005.00581v1
- Date: Fri, 1 May 2020 19:58:56 GMT
- Title: Multi-scale Transformer Language Models
- Authors: Sandeep Subramanian, Ronan Collobert, Marc'Aurelio Ranzato, Y-Lan
Boureau
- Abstract summary: We investigate multi-scale transformer language models that learn representations of text at multiple scales.
We present three different architectures that have an inductive bias to handle the hierarchical nature of language.
- Score: 30.201934597815583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate multi-scale transformer language models that learn
representations of text at multiple scales, and present three different
architectures that have an inductive bias to handle the hierarchical nature of
language. Experiments on large-scale language modeling benchmarks empirically
demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show
that it is possible to train a hierarchical variant with 30 layers that has 23%
smaller memory footprint and better perplexity, compared to a vanilla
transformer with less than half the number of layers, on the Toronto
BookCorpus. We analyze the advantages of learned representations at multiple
scales in terms of memory footprint, compute time, and perplexity, which are
particularly appealing given the quadratic scaling of transformers' run time
and memory usage with respect to sequence length.
Related papers
- How Redundant Is the Transformer Stack in Speech Representation Models? [1.3873323883842132]
Self-supervised speech representation models have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection.
Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning.
We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training.
arXiv Detail & Related papers (2024-09-10T11:00:24Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction
Tuning [115.50132185963139]
CM3Leon is a decoder-only multi-modal language model capable of generating and infilling both text and images.
It is the first multi-modal model trained with a recipe adapted from text-only language models.
CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods.
arXiv Detail & Related papers (2023-09-05T21:27:27Z) - Learning Multiscale Transformer Models for Sequence Generation [33.73729074207944]
We build a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge.
Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.
arXiv Detail & Related papers (2022-06-19T07:28:54Z) - Improving language models by retrieving from trillions of tokens [50.42630445476544]
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
arXiv Detail & Related papers (2021-12-08T17:32:34Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - SML: a new Semantic Embedding Alignment Transformer for efficient
cross-lingual Natural Language Inference [71.57324258813674]
The ability of Transformers to perform with precision a variety of tasks such as question answering, Natural Language Inference (NLI) or summarising, have enable them to be ranked as one of the best paradigms to address this kind of tasks at present.
NLI is one of the best scenarios to test these architectures, due to the knowledge required to understand complex sentences and established a relation between a hypothesis and a premise.
In this paper, we propose a new architecture, siamese multilingual transformer, to efficiently align multilingual embeddings for Natural Language Inference.
arXiv Detail & Related papers (2021-03-17T13:23:53Z) - Memory Transformer [0.31406146587437894]
Transformer-based models have achieved state-of-the-art results in many natural language processing tasks.
Memory-augmented neural networks (MANNs) extend traditional neural architectures with general-purpose memory for representations.
We evaluate these memory augmented Transformers and demonstrate that presence of memory positively correlates with the model performance.
arXiv Detail & Related papers (2020-06-20T09:06:27Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.