Related papers: Simplicity Bias of Transformers to Learn Low Sensitivity Functions

Simplicity Bias of Transformers to Learn Low Sensitivity Functions

URL: http://arxiv.org/abs/2403.06925v1
Date: Mon, 11 Mar 2024 17:12:09 GMT
Title: Simplicity Bias of Transformers to Learn Low Sensitivity Functions
Authors: Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang, Vatsal Sharan
Abstract summary: Transformers achieve state-of-the-art accuracy and robustness across many tasks. An understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive.
Score: 19.898451497341714
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive. Various neural network architectures such as fully connected networks have been found to have a simplicity bias towards simple functions of the data; one version of this simplicity bias is a spectral bias to learn simple functions in the Fourier space. In this work, we identify the notion of sensitivity of the model to random changes in the input as a notion of simplicity bias which provides a unified metric to explain the simplicity and spectral bias of transformers across different data modalities. We show that transformers have lower sensitivity than alternative architectures, such as LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that low-sensitivity bias correlates with improved robustness; furthermore, it can also be used as an efficient intervention to further improve the robustness of transformers.

Related papers

A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions. This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z)
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning. At its core, the attention block differs in form and functionality from most other architectural components in deep learning. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z)
Differential Transformer [99.5117269150629]
Transformer tends to overallocate attention to irrelevant context. We introduce Diff Transformer, which amplifies attention to relevant context while canceling noise. It offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.
arXiv Detail & Related papers (2024-10-07T17:57:38Z)
Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Why are Sensitive Functions Hard for Transformers? [1.0632690677209804]
We show that under the transformer architecture, the loss landscape is constrained by the input-space sensitivity. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers.
arXiv Detail & Related papers (2024-02-15T14:17:51Z)
A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions. We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets. We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z)
Mitigating Bias in Visual Transformers via Targeted Alignment [8.674650784377196]
We study the fairness of transformers applied to computer vision and benchmark several bias mitigation approaches from prior work. We propose TADeT, a targeted alignment strategy for debiasing transformers that aims to discover and remove bias primarily from query matrix features.
arXiv Detail & Related papers (2023-02-08T22:11:14Z)
Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions [29.461559919821802]
Recent works have found that Transformers struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models.
arXiv Detail & Related papers (2022-11-22T15:10:48Z)
Neural networks trained with SGD learn distributions of increasing complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics. We then exploit higher-order statistics only later during training. We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
The Nuts and Bolts of Adopting Transformer in GANs [124.30856952272913]
We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis. Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
arXiv Detail & Related papers (2021-10-25T17:01:29Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
On the Adversarial Robustness of Visual Transformers [129.29523847765952]
This work provides the first and comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-03-29T14:48:24Z)
Translational Equivariance in Kernelizable Attention [3.236198583140341]
We show how translational equivariance can be implemented in efficient Transformers based on kernelizable attention. Our experiments highlight that the devised approach significantly improves robustness of Performers to shifts of input images.
arXiv Detail & Related papers (2021-02-15T17:14:15Z)
Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers. The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation. These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.