Simplicity Bias of Transformers to Learn Low Sensitivity Functions
- URL: http://arxiv.org/abs/2403.06925v1
- Date: Mon, 11 Mar 2024 17:12:09 GMT
- Title: Simplicity Bias of Transformers to Learn Low Sensitivity Functions
- Authors: Bhavya Vasudeva, Deqing Fu, Tianyi Zhou, Elliott Kau, Youqi Huang,
Vatsal Sharan
- Abstract summary: Transformers achieve state-of-the-art accuracy and robustness across many tasks.
An understanding of the inductive biases that they have and how those biases are different from other neural network architectures remains elusive.
- Score: 19.898451497341714
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformers achieve state-of-the-art accuracy and robustness across many
tasks, but an understanding of the inductive biases that they have and how
those biases are different from other neural network architectures remains
elusive. Various neural network architectures such as fully connected networks
have been found to have a simplicity bias towards simple functions of the data;
one version of this simplicity bias is a spectral bias to learn simple
functions in the Fourier space. In this work, we identify the notion of
sensitivity of the model to random changes in the input as a notion of
simplicity bias which provides a unified metric to explain the simplicity and
spectral bias of transformers across different data modalities. We show that
transformers have lower sensitivity than alternative architectures, such as
LSTMs, MLPs and CNNs, across both vision and language tasks. We also show that
low-sensitivity bias correlates with improved robustness; furthermore, it can
also be used as an efficient intervention to further improve the robustness of
transformers.
Related papers
- A distributional simplicity bias in the learning dynamics of transformers [50.91742043564049]
We show that transformers, trained on natural language data, also display a simplicity bias.
Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions.
This approach opens up the possibilities of studying how interactions of different orders in the data affect learning, in natural language processing and beyond.
arXiv Detail & Related papers (2024-10-25T15:39:34Z) - What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning.
At its core, the attention block differs in form and functionality from most other architectural components in deep learning.
The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z) - Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption.
We analyze how magnitude-based models affect generalization while improving adaption.
We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z) - Why are Sensitive Functions Hard for Transformers? [1.0632690677209804]
We show that under the transformer architecture, the loss landscape is constrained by the input-space sensitivity.
We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers.
arXiv Detail & Related papers (2024-02-15T14:17:51Z) - A Scalable Walsh-Hadamard Regularizer to Overcome the Low-degree
Spectral Bias of Neural Networks [79.28094304325116]
Despite the capacity of neural nets to learn arbitrary functions, models trained through gradient descent often exhibit a bias towards simpler'' functions.
We show how this spectral bias towards low-degree frequencies can in fact hurt the neural network's generalization on real-world datasets.
We propose a new scalable functional regularization scheme that aids the neural network to learn higher degree frequencies.
arXiv Detail & Related papers (2023-05-16T20:06:01Z) - Mitigating Bias in Visual Transformers via Targeted Alignment [8.674650784377196]
We study the fairness of transformers applied to computer vision and benchmark several bias mitigation approaches from prior work.
We propose TADeT, a targeted alignment strategy for debiasing transformers that aims to discover and remove bias primarily from query matrix features.
arXiv Detail & Related papers (2023-02-08T22:11:14Z) - Simplicity Bias in Transformers and their Ability to Learn Sparse
Boolean Functions [29.461559919821802]
Recent works have found that Transformers struggle to model several formal languages when compared to recurrent models.
This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models.
arXiv Detail & Related papers (2022-11-22T15:10:48Z) - Neural networks trained with SGD learn distributions of increasing
complexity [78.30235086565388]
We show that neural networks trained using gradient descent initially classify their inputs using lower-order input statistics.
We then exploit higher-order statistics only later during training.
We discuss the relation of DSB to other simplicity biases and consider its implications for the principle of universality in learning.
arXiv Detail & Related papers (2022-11-21T15:27:22Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.