Pathologies in priors and inference for Bayesian transformers
- URL: http://arxiv.org/abs/2110.04020v1
- Date: Fri, 8 Oct 2021 10:35:27 GMT
- Title: Pathologies in priors and inference for Bayesian transformers
- Authors: Tristan Cinquin, Alexander Immer, Max Horn, Vincent Fortuin
- Abstract summary: No successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist.
We find that weight-space inference in transformers does not work well, regardless of the approximate posterior.
We propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights.
- Score: 71.97183475225215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the transformer has established itself as a workhorse in
many applications ranging from natural language processing to reinforcement
learning. Similarly, Bayesian deep learning has become the gold-standard for
uncertainty estimation in safety-critical applications, where robustness and
calibration are crucial. Surprisingly, no successful attempts to improve
transformer models in terms of predictive uncertainty using Bayesian inference
exist. In this work, we study this curiously underpopulated area of Bayesian
transformers. We find that weight-space inference in transformers does not work
well, regardless of the approximate posterior. We also find that the prior is
at least partially at fault, but that it is very hard to find well-specified
weight priors for these models. We hypothesize that these problems stem from
the complexity of obtaining a meaningful mapping from weight-space to
function-space distributions in the transformer. Therefore, moving closer to
function-space, we propose a novel method based on the implicit
reparameterization of the Dirichlet distribution to apply variational inference
directly to the attention weights. We find that this proposed method performs
competitively with our baselines.
Related papers
- Bayes without Underfitting: Fully Correlated Deep Learning Posteriors via Alternating Projections [11.893371164199312]
Bayesian deep learning all too often underfits so that the Bayesian prediction is less accurate than a simple point estimate.
We propose to build Bayesian approximations in a null space, thereby guaranteeing that the Bayesian predictive does not underfit.
An empirical evaluation shows that the approach scales to large models, including vision transformers with 28 million parameters.
arXiv Detail & Related papers (2024-10-22T11:15:07Z) - Setting the Record Straight on Transformer Oversmoothing [35.125957267464756]
As model depth increases, Transformers oversmooth, i.e., inputs become more and more similar.
We show that smoothing behavior depends on the eigenspectrum of the value and projection weights.
Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior.
arXiv Detail & Related papers (2024-01-09T01:19:03Z) - All Roads Lead to Rome? Exploring the Invariance of Transformers'
Representations [69.3461199976959]
We propose a model based on invertible neural networks, BERT-INN, to learn the Bijection Hypothesis.
We show the advantage of BERT-INN both theoretically and through extensive experiments.
arXiv Detail & Related papers (2023-05-23T22:30:43Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - On Isotropy Calibration of Transformers [10.294618771570985]
Studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic.
A recent study shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space.
We conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks.
arXiv Detail & Related papers (2021-09-27T18:54:10Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - The FMRIB Variational Bayesian Inference Tutorial II: Stochastic
Variational Bayes [1.827510863075184]
This tutorial revisits the original FMRIB Variational Bayes tutorial.
This new approach bears a lot of similarity to, and has benefited from, computational methods applied to machine learning algorithms.
arXiv Detail & Related papers (2020-07-03T11:31:52Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.