Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers
- URL: http://arxiv.org/abs/2205.08078v1
- Date: Tue, 17 May 2022 04:01:15 GMT
- Title: Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers
- Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza
Mardani, Mert Pilanci
- Abstract summary: This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
- Score: 52.468311268601056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers using self-attention or its proposed alternatives have
demonstrated promising results in many image related tasks. However, the
underpinning inductive bias of attention is not well understood. To address
this issue, this paper analyzes attention through the lens of convex duality.
For the non-linear dot-product self-attention, and alternative mechanisms such
as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent
finite-dimensional convex problems that are interpretable and solvable to
global optimality. The convex programs lead to {\it block nuclear-norm
regularization} that promotes low rank in the latent feature and token
dimensions. In particular, we show how self-attention networks implicitly
clusters the tokens, based on their latent similarity. We conduct experiments
for transferring a pre-trained transformer backbone for CIFAR-100
classification by fine-tuning a variety of convex attention heads. The results
indicate the merits of the bias induced by attention compared with the existing
MLP or linear heads.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem.
We characterize the implicit bias of 1-layer transformers optimized with gradient descent.
We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z) - Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image
Anomaly Detection [13.801572236048601]
FOcus-the-Discrepancy (FOD) can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies.
In this paper, we propose a novel AD framework: FOcus-the-Discrepancy (FOD), which can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies.
arXiv Detail & Related papers (2023-08-06T01:30:26Z) - Transformers meet Stochastic Block Models: Attention with Data-Adaptive
Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention.
We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model.
Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z) - The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers.
They usually suffer from degraded performances on various tasks and corpus.
In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z) - Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks.
We show that their output can be decomposed into a sum of smaller terms.
We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.