How Far Does BERT Look At:Distance-based Clustering and Analysis of
BERT$'$s Attention
- URL: http://arxiv.org/abs/2011.00943v2
- Date: Tue, 3 Nov 2020 04:25:12 GMT
- Title: How Far Does BERT Look At:Distance-based Clustering and Analysis of
BERT$'$s Attention
- Authors: Yue Guan, Jingwen Leng, Chao Li, Quan Chen, Minyi Guo
- Abstract summary: We cluster attention heatmaps into significantly different patterns through unsupervised clustering.
Our proposed features can be used to explain and calibrate different attention heads in Transformer models.
- Score: 20.191319097826266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research on the multi-head attention mechanism, especially that in
pre-trained models such as BERT, has shown us heuristics and clues in analyzing
various aspects of the mechanism. As most of the research focus on probing
tasks or hidden states, previous works have found some primitive patterns of
attention head behavior by heuristic analytical methods, but a more systematic
analysis specific on the attention patterns still remains primitive. In this
work, we clearly cluster the attention heatmaps into significantly different
patterns through unsupervised clustering on top of a set of proposed features,
which corroborates with previous observations. We further study their
corresponding functions through analytical study. In addition, our proposed
features can be used to explain and calibrate different attention heads in
Transformer models.
Related papers
- On the Anatomy of Attention [0.0]
We introduce a category-theoretic diagrammatic formalism in order to systematically relate and reason about machine learning models.
Our diagrams present architectures intuitively but without loss of essential detail, where natural relationships between models are captured by graphical transformations.
arXiv Detail & Related papers (2024-07-02T16:50:26Z) - Attention Diversification for Domain Generalization [92.02038576148774]
Convolutional neural networks (CNNs) have demonstrated gratifying results at learning discriminative features.
When applied to unseen domains, state-of-the-art models are usually prone to errors due to domain shift.
We propose a novel Attention Diversification framework, in which Intra-Model and Inter-Model Attention Diversification Regularization are collaborated.
arXiv Detail & Related papers (2022-10-09T09:15:21Z) - A General Survey on Attention Mechanisms in Deep Learning [7.5537115673774275]
This survey provides an overview of the most important attention mechanisms proposed in the literature.
The various attention mechanisms are explained by means of a framework consisting of a general attention model, uniform notation, and a comprehensive taxonomy of attention mechanisms.
arXiv Detail & Related papers (2022-03-27T10:06:23Z) - Alignment Attention by Matching Key and Query Distributions [48.93793773929006]
This paper introduces alignment attention that explicitly encourages self-attention to match the distributions of the key and query within each head.
It is simple to convert any models with self-attention, including pre-trained ones, to the proposed alignment attention.
On a variety of language understanding tasks, we show the effectiveness of our method in accuracy, uncertainty estimation, generalization across domains, and robustness to adversarial attacks.
arXiv Detail & Related papers (2021-10-25T00:54:57Z) - Building Interpretable Models for Business Process Prediction using
Shared and Specialised Attention Mechanisms [5.607831842909669]
We address the "black-box" problem in predictive process analytics by building interpretable models.
We propose two types of attentions: event attention to capture the impact of specific process events on a prediction, and attribute attention to reveal which attribute(s) of an event influenced the prediction.
arXiv Detail & Related papers (2021-09-03T10:17:05Z) - Multilingual Multi-Aspect Explainability Analyses on Machine Reading Comprehension Models [76.48370548802464]
This paper focuses on conducting a series of analytical experiments to examine the relations between the multi-head self-attention and the final MRC system performance.
We discover that passage-to-question and passage understanding attentions are the most important ones in the question answering process.
Through comprehensive visualizations and case studies, we also observe several general findings on the attention maps, which can be helpful to understand how these models solve the questions.
arXiv Detail & Related papers (2021-08-26T04:23:57Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Repulsive Attention: Rethinking Multi-head Attention as Bayesian
Inference [68.12511526813991]
We provide a novel understanding of multi-head attention from a Bayesian perspective.
We propose a non-parametric approach that explicitly improves the repulsiveness in multi-head attention.
Experiments on various attention models and applications demonstrate that the proposed repulsive attention can improve the learned feature diversity.
arXiv Detail & Related papers (2020-09-20T06:32:23Z) - Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models.
High-dimensionality and non-linear issues are traditionally handled by kernel methods.
We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.