Simplicity Bias of Transformers to Learn Low Sensitivity Functions
- URL: http://arxiv.org/abs/2403.06925v1
- Date: Mon, 11 Mar 2024 17:12:09 GMT
- Title: Simplicity Bias of Transformers to Learn Low Sensitivity Functions
- Authors: Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang,
Vatsal Sharan
- Abstract summary: Transformers achieve state-of-the-art accuracy and robustness across many tasks.
An understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive.
- Score: 19.898451497341714
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformers achieve state-of-the-art accuracy and robustness across many
tasks, but an understanding of the inductive biases that they have and how
those biases are different from other neural network architectures remains
elusive. Various neural network architectures such as fully connected networks
have been found to have a simplicity bias towards simple functions of the data;
one version of this simplicity bias is a spectral bias to learn simple
functions in the Fourier space. In this work, we identify the notion of
sensitivity of the model to random changes in the input as a notion of
simplicity bias which provides a unified metric to explain the simplicity and
spectral bias of transformers across different data modalities. We show that
transformers have lower sensitivity than alternative architectures, such as
LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that
low-sensitivity bias correlates with improved robustness; furthermore, it can
also be used as an efficient intervention to further improve the robustness of
transformers.
Related papers
- Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context.
We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise.
It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z) - Why are Sensitive Functions Hard for Transformers? [1.0632690677209804]
We show that under the transformer architecture, the loss landscape is constrained by the input-space sensitivity.
We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers.
arXiv Detail & Related papers (2024-02-15T14:17:51Z) - Simplicity Bias in Transformers and their Ability to Learn Sparse
Boolean Functions [29.461559919821802]
Recent works have found that Transformers struggle to model several formal languages when compared to recurrent models.
This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models.
arXiv Detail & Related papers (2022-11-22T15:10:48Z) - The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z) - The Nuts and Bolts of Adopting Transformer in GANs [124.30856952272913]
We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis.
Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
arXiv Detail & Related papers (2021-10-25T17:01:29Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations.
Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z) - Translational Equivariance in Kernelizable Attention [3.236198583140341]
We show how translational equivariance can be implemented in efficient Transformers based on kernelizable attention.
Our experiments highlight that the devised approach significantly improves robustness of Performers to shifts of input images.
arXiv Detail & Related papers (2021-02-15T17:14:15Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.