Related papers: Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation

Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation

URL: http://arxiv.org/abs/2106.08823v1
Date: Wed, 16 Jun 2021 14:38:42 GMT
Title: Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation
Authors: Srinadh Bhojanapalli, Ayan Chakrabarti, Himanshu Jain, Sanjiv Kumar, Michal Lukasik, Andreas Veit
Abstract summary: We study the global structure of attention scores computed using dot-product based self-attention. We find that most of the variation among attention scores lie in a low-dimensional eigenspace. We propose to compute scores only for a partial subset of token pairs, and use them to estimate scores for the remaining pairs.
Score: 58.80806716024701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of their variation. Through eigen analysis of full attention score matrices, as well as of their individual rows, we find that most of the variation among attention scores lie in a low-dimensional eigenspace. Moreover, we find significant overlap between these eigenspaces for different layers and even different transformer models. Based on this, we propose to compute scores only for a partial subset of token pairs, and use them to estimate scores for the remaining pairs. Beyond investigating the accuracy of reconstructing attention scores themselves, we investigate training transformer models that employ these approximations, and analyze the effect on overall accuracy. Our analysis and the proposed method provide insights into how to balance the benefits of exact pair-wise attention and its significant computational expense.

Related papers

Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture. We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z)
Spectral Self-supervised Feature Selection [7.052728135831165]
We propose a self-supervised graph-based approach for unsupervised feature selection. Our method's core involves computing robust pseudo-labels by applying simple processing steps to the graph Laplacian's eigenvectors. Our approach is shown to be robust to challenging scenarios, such as the presence of outliers and complex substructures.
arXiv Detail & Related papers (2024-07-12T07:29:08Z)
Are queries and keys always relevant? A case study on Transformer wave functions [0.0]
dot product attention mechanism, originally designed for natural language processing (NLP) tasks, is a cornerstone of modern Transformers. We explore the suitability of Transformers, focusing on their attention mechanisms, in the specific domain of the parametrization of variational wave functions.
arXiv Detail & Related papers (2024-05-29T08:32:37Z)
Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations [15.59251297818324]
We present an approach for analyzing grouping information contained within a neural network's activations. We exploit features from all layers and obviating the need to guess which part of the model contains relevant information.
arXiv Detail & Related papers (2023-12-11T01:20:34Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey [5.967999555890417]
This tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE) They asssume that every data point is generated from or caused by a low-dimensional latent factor. For their inference and generative behaviour, these models can also be used for generation of new data points in the data space.
arXiv Detail & Related papers (2021-01-04T01:29:09Z)
Linear Classifier Combination via Multiple Potential Functions [0.6091702876917279]
We propose a novel concept of calculating a scoring function based on the distance of the object from the decision boundary and its distance to the class centroid. An important property is that the proposed score function has the same nature for all linear base classifiers.
arXiv Detail & Related papers (2020-10-02T08:11:51Z)
Nonparametric Score Estimators [49.42469547970041]
Estimating the score from a set of samples generated by an unknown distribution is a fundamental task in inference and learning of probabilistic models. We provide a unifying view of these estimators under the framework of regularized nonparametric regression. We propose score estimators based on iterative regularization that enjoy computational benefits from curl-free kernels and fast convergence.
arXiv Detail & Related papers (2020-05-20T15:01:03Z)
Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets. We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator. We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.