Related papers: Regularizing Transformers With Deep Probabilistic Layers

Regularizing Transformers With Deep Probabilistic Layers

URL: http://arxiv.org/abs/2108.10764v1
Date: Mon, 23 Aug 2021 10:17:02 GMT
Title: Regularizing Transformers With Deep Probabilistic Layers
Authors: Aurora Cobo Aguilera, Pablo Mart\'inez Olmos, Antonio Art\'es-Rodr\'iguez, Fernando P\'erez-Cruz
Abstract summary: In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models. We prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.
Score: 62.997667081978825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models (LM) have grown with non-stop in the last decade, from sequence-to-sequence architectures to the state-of-the-art and utter attention-based Transformers. In this work, we demonstrate how the inclusion of deep generative models within BERT can bring more versatile models, able to impute missing/noisy words with richer text or even improve BLEU score. More precisely, we use a Gaussian Mixture Variational Autoencoder (GMVAE) as a regularizer layer and prove its effectiveness not only in Transformers but also in the most relevant encoder-decoder based LM, seq2seq with and without attention.

Related papers

Transformers without Normalization [58.778767721826206]
We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = tanh(alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.
arXiv Detail & Related papers (2025-03-13T17:59:06Z)
MoEUT: Mixture-of-Experts Universal Transformers [75.96744719516813]
Universal Transformers (UTs) have advantages over standard Transformers in learning compositional generalizations. Layer-sharing drastically reduces the parameter count compared to the non-shared model with the same dimensionality. No previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling.
arXiv Detail & Related papers (2024-05-25T03:24:32Z)
Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model. Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
GIVT: Generative Infinite-Vocabulary Transformers [18.55070896912795]
We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate vector sequences with real-valued entries. Inspired by the image-generation paradigm of VQ-GAN and MaskGIT, we use GIVT to model the unquantized real-valued latent sequences of a $beta$-VAE. In class-conditional image generation GIVT outperforms VQ-GAN as well as MaskGIT, and achieves performance competitive with recent latent diffusion models.
arXiv Detail & Related papers (2023-12-04T18:48:02Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
Closing the gap: Exact maximum likelihood training of generative autoencoders using invertible layers [7.76925617801895]
We show that VAE-style autoencoders can be constructed using invertible layers, which offer a tractable exact likelihood without the need for regularization terms. This is achieved while leaving complete freedom in the choice of encoder, decoder and prior architectures. We show that the approach results in strikingly higher performance than architecturally equivalent VAEs in term of log-likelihood, sample quality and denoising performance.
arXiv Detail & Related papers (2022-05-19T13:16:09Z)
Hierarchical RNNs-Based Transformers MADDPG for Mixed Cooperative-Competitive Environments [1.9241821314180374]
This paper proposed a hierarchical transformers MADDPG based on RNN which we call it Hierarchical RNNs-Based Transformers HRTMADDPG. It consists of a lower level encoder based on RNN that encodes multiple step sizes in each time sequence, and it also consists of an upper sequence level encoder based on transformer for learning the correlations between multiple sequences.
arXiv Detail & Related papers (2021-05-11T09:22:52Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
Segatron: Segment-Aware Transformer for Language Modeling and Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.