Bayesian Attention Modules
- URL: http://arxiv.org/abs/2010.10604v1
- Date: Tue, 20 Oct 2020 20:30:55 GMT
- Title: Bayesian Attention Modules
- Authors: Xinjie Fan and Shujian Zhang and Bo Chen and Mingyuan Zhou
- Abstract summary: We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
- Score: 65.52970388117923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention modules, as simple and effective tools, have not only enabled deep
neural networks to achieve state-of-the-art results in many domains, but also
enhanced their interpretability. Most current models use deterministic
attention modules due to their simplicity and ease of optimization. Stochastic
counterparts, on the other hand, are less popular despite their potential
benefits. The main reason is that stochastic attention often introduces
optimization issues or requires significant model changes. In this paper, we
propose a scalable stochastic version of attention that is easy to implement
and optimize. We construct simplex-constrained attention distributions by
normalizing reparameterizable distributions, making the training process
differentiable. We learn their parameters in a Bayesian framework where a
data-dependent prior is introduced for regularization. We apply the proposed
stochastic attention modules to various attention-based models, with
applications to graph node classification, visual question answering, image
captioning, machine translation, and language understanding. Our experiments
show the proposed method brings consistent improvements over the corresponding
baselines.
Related papers
- Adjusting Pretrained Backbones for Performativity [34.390793811659556]
We propose a novel technique to adjust pretrained backbones for performativity in a modular way.
We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations.
arXiv Detail & Related papers (2024-10-06T14:41:13Z) - iSeg: An Iterative Refinement-based Framework for Training-free Segmentation [85.58324416386375]
We present a deep experimental analysis on iteratively refining cross-attention map with self-attention map.
We propose an effective iterative refinement framework for training-free segmentation, named iSeg.
Our proposed iSeg achieves an absolute gain of 3.8% in terms of mIoU compared to the best existing training-free approach in literature.
arXiv Detail & Related papers (2024-09-05T03:07:26Z) - Patch-Prompt Aligned Bayesian Prompt Tuning for Vision-Language Models [48.77653835765705]
We introduce a probabilistic resolution to prompt tuning, where the label-specific prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model.
We evaluate the effectiveness of our approach on four tasks: few-shot image recognition, base-to-new generalization, dataset transfer learning, and domain shifts.
arXiv Detail & Related papers (2023-03-16T06:09:15Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Bayesian Graph Contrastive Learning [55.36652660268726]
We propose a novel perspective of graph contrastive learning methods showing random augmentations leads to encoders.
Our proposed method represents each node by a distribution in the latent space in contrast to existing techniques which embed each node to a deterministic vector.
We show a considerable improvement in performance compared to existing state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2021-12-15T01:45:32Z) - Probabilistic Attention for Interactive Segmentation [0.0]
We show that the standard dot-product attention in transformers is a special case of Maximum A Posteriori (MAP) inference.
The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters.
arXiv Detail & Related papers (2021-06-23T00:19:43Z) - Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks.
This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights.
We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z) - More Is More -- Narrowing the Generalization Gap by Adding
Classification Heads [8.883733362171032]
We introduce an architecture enhancement for existing neural network models based on input transformations, termed 'TransNet'
Our model can be employed during training time only and then pruned for prediction, resulting in an equivalent architecture to the base model.
arXiv Detail & Related papers (2021-02-09T16:30:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.