Regularizing Transformers With Deep Probabilistic Layers
- URL: http://arxiv.org/abs/2108.10764v1
- Date: Mon, 23 Aug 2021 10:17:02 GMT
- Title: Regularizing Transformers With Deep Probabilistic Layers
- Authors: Aurora Cobo Aguilera, Pablo Mart\'inez Olmos, Antonio
Art\'es-Rodr\'iguez, Fernando P\'erez-Cruz
- Abstract summary: In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models.
We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
- Score: 62.997667081978825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LM) have grown with non-stop in the last decade, from
sequence-to-sequence architectures to the state-of-the-art and utter
attention-based Transformers. In this work, we demonstrate how the inclusion of
deep generative models within BERT can bring more versatile models, able to
impute missing/noisy words with richer text or even improve BLEU score. More
precisely, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a
regularizer layer and prove its effectiveness not only in Transformers but also
in the most relevant encoder-decoder based LM, seq2seq with and without
attention.
Related papers
- MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations.
Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality.
No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z) - Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - GIVT: Generative Infinite-Vocabulary Transformers [18.55070896912795]
We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries.
Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, we use GIVT to model the unquantized real-valued latent sequences of a $beta$-VAE.
In class-conditional image generation GIVT outperforms VQ-GAN as well as MaskGIT, and achieves performance competitive with recent latent diffusion models.
arXiv Detail & Related papers (2023-12-04T18:48:02Z) - Functional Interpolation for Relative Positions Improves Long Context
Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts.
We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple.
We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z) - Closing the gap: Exact maximum likelihood training of generative
autoencoders using invertible layers [7.76925617801895]
We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for regularization terms.
This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures.
We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance.
arXiv Detail & Related papers (2022-05-19T13:16:09Z) - Hierarchical RNNs-Based Transformers MADDPG for Mixed
Cooperative-Competitive Environments [1.9241821314180374]
This paper proposed a hierarchical transformers MADDPG based on RNN which we call it Hierarchical RNNs-Based Transformers HRTMADDPG.
It consists of a lower level encoder based on RNN that encodes multiple step sizes in each time sequence, and it also consists of an upper sequence level encoder based on transformer for learning the correlations between multiple sequences.
arXiv Detail & Related papers (2021-05-11T09:22:52Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.