Related papers: BayesFormer: Transformer with Uncertainty Estimation

BayesFormer: Transformer with Uncertainty Estimation

URL: http://arxiv.org/abs/2206.00826v1
Date: Thu, 2 Jun 2022 01:54:58 GMT
Title: BayesFormer: Transformer with Uncertainty Estimation
Authors: Karthik Abinav Sankararaman and Sinong Wang and Han Fang
Abstract summary: We introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory. We show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.
Score: 31.206243748162553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer has become ubiquitous due to its dominant performance in various NLP and image processing tasks. However, it lacks understanding of how to generate mathematically grounded uncertainty estimates for transformer architectures. Models equipped with such uncertainty estimates can typically improve predictive performance, make networks robust, avoid over-fitting and used as acquisition function in active learning. In this paper, we introduce BayesFormer, a Transformer model with dropouts designed by Bayesian theory. We proposed a new theoretical framework to extend the approximate variational inference-based dropout to Transformer-based architectures. Through extensive experiments, we validate the proposed architecture in four paradigms and show improvements across the board: language modeling and classification, long-sequence understanding, machine translation and acquisition function for active learning.

Related papers

Universal Approximation Theorem for a Single-Layer Transformer [0.0]
Deep learning employs multi-layer neural networks trained via the backpropagation algorithm.<n>Transformers have achieved state-of-the-art performance in natural language processing.<n>We prove that a single-layer Transformer, comprising one self-attention layer followed by a position-wise feed-forward network with ReLU activation, can any continuous sequence-to-sequence mapping on a compact domain to arbitrary precision.
arXiv Detail & Related papers (2025-07-11T11:37:39Z)
Neural ODE Transformers: Analyzing Internal Dynamics and Adaptive Fine-tuning [30.781578037476347]
We introduce a novel approach to modeling transformer architectures using highly flexible non-autonomous neural ordinary differential equations (ODEs) Our proposed model parameterizes all weights of attention and feed-forward blocks through neural networks, expressing these weights as functions of a continuous layer index. Our neural ODE transformer demonstrates performance comparable to or better than vanilla transformers across various configurations and datasets.
arXiv Detail & Related papers (2025-03-03T09:12:14Z)
Entropy-Lens: The Information Signature of Transformer Computations [14.613982627206884]
We introduce Entropy-Lens, a model-agnostic framework to interpret frozen, off-the-shelf large-scale transformers. Our results suggest that entropy-based metrics can serve as a principled tool for unveiling the inner workings of modern transformer architectures.
arXiv Detail & Related papers (2025-02-23T13:33:27Z)
OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization [1.7180235064112577]
We consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model.
arXiv Detail & Related papers (2025-01-30T22:52:40Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework [2.8470354623829577]
We develop a framework based on Concept Bottleneck Models to enforce interpretability of time series Transformers. We modify the training objective to encourage a model to develop representations similar to predefined interpretable concepts. We find that the model performance remains mostly unaffected, while the model shows much improved interpretability.
arXiv Detail & Related papers (2024-10-08T14:22:40Z)
Strengthening Structural Inductive Biases by Pre-training to Perform Syntactic Transformations [75.14793516745374]
We propose to strengthen the structural inductive bias of a Transformer by intermediate pre-training. Our experiments confirm that this helps with few-shot learning of syntactic tasks such as chunking. Our analysis shows that the intermediate pre-training leads to attention heads that keep track of which syntactic transformation needs to be applied to which token.
arXiv Detail & Related papers (2024-07-05T14:29:44Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Affine transformation estimation improves visual self-supervised learning [4.40560654491339]
We show that adding a module to constrain the representations to be predictive of an affine transformation improves the performance and efficiency of the learning process. We perform experiments in various modern self-supervised models and see a performance improvement in all cases.
arXiv Detail & Related papers (2024-02-14T10:32:58Z)
A Meta-Learning Perspective on Transformers for Causal Language Modeling [17.293733942245154]
The Transformer architecture has become prominent in developing large causal language models. We establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task. Within the inner optimization, we discover and theoretically analyze a special characteristic of the norms of learned token representations within Transformer-based causal language models.
arXiv Detail & Related papers (2023-10-09T17:27:36Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
A Cognitive Study on Semantic Similarity Analysis of Large Corpora: A Transformer-based Approach [0.0]
We perform semantic similarity analysis and modeling on the U.S. Patent Phrase to Phrase Matching dataset using both traditional and transformer-based techniques. The experimental results demonstrate our methodology's enhanced performance compared to traditional techniques, with an average Pearson correlation score of 0.79.
arXiv Detail & Related papers (2022-07-24T11:06:56Z)
Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale [31.293175512404172]
We introduce Transformer Grammars -- a class of Transformer language models that combine expressive power, scalability, and strong performance of Transformers. We find that Transformer Grammars outperform various strong baselines on multiple syntax-sensitive language modeling evaluation metrics.
arXiv Detail & Related papers (2022-03-01T17:22:31Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex. This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.