Related papers: Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers

URL: http://arxiv.org/abs/2205.08078v1
Date: Tue, 17 May 2022 04:01:15 GMT
Title: Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers
Authors: Arda Sahiner, Tolga Ergen, Batu Ozturkler, John Pauly, Morteza Mardani, Mert Pilanci
Abstract summary: This paper analyzes attention through the lens of convex duality. We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
Score: 52.468311268601056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision transformers using self-attention or its proposed alternatives have demonstrated promising results in many image related tasks. However, the underpinning inductive bias of attention is not well understood. To address this issue, this paper analyzes attention through the lens of convex duality. For the non-linear dot-product self-attention, and alternative mechanisms such as MLP-mixer and Fourier Neural Operator (FNO), we derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. The convex programs lead to {\it block nuclear-norm regularization} that promotes low rank in the latent feature and token dimensions. In particular, we show how self-attention networks implicitly clusters the tokens, based on their latent similarity. We conduct experiments for transferring a pre-trained transformer backbone for CIFAR-100 classification by fine-tuning a variety of convex attention heads. The results indicate the merits of the bias induced by attention compared with the existing MLP or linear heads.

Related papers

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans [13.695885742446027]
Self-attention can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. We introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport. Our method enforces doubleity without iterative Sinkhorn normalization, significantly enhancing efficiency.
arXiv Detail & Related papers (2025-02-11T21:20:48Z)
A Unified Perspective on the Dynamics of Deep Transformers [24.094975798576783]
We study the evolution of data anisotropy through a deep Transformer. We highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.
arXiv Detail & Related papers (2025-01-30T13:04:54Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection [13.801572236048601]
FOcus-the-Discrepancy (FOD) can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies. In this paper, we propose a novel AD framework: FOcus-the-Discrepancy (FOD), which can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies.
arXiv Detail & Related papers (2023-08-06T01:30:26Z)
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost [53.746169882193456]
Recent works have proposed various sparse attention modules to overcome the quadratic cost of self-attention. We propose a model that resolves both problems by endowing each attention head with a mixed-membership Block Model. Our model outperforms previous efficient variants as well as the original Transformer with full attention.
arXiv Detail & Related papers (2022-10-27T15:30:52Z)
The Devil in Linear Transformer [42.232886799710215]
Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. They usually suffer from degraded performances on various tasks and corpus. In this paper, we identify two key issues that lead to such performance gaps.
arXiv Detail & Related papers (2022-10-19T07:15:35Z)
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.