Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers
- URL: http://arxiv.org/abs/2006.03555v3
- Date: Thu, 1 Oct 2020 00:41:49 GMT
- Title: Masked Language Modeling for Proteins via Linearly Scalable Long-Context
Transformers
- Authors: Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou
Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, David Belanger,
Lucy Colwell, Adrian Weller
- Abstract summary: We present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR)
Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors.
It provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence.
- Score: 42.93754828584075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have achieved state-of-the-art results across a diverse
range of domains. However, concern over the cost of training the attention
mechanism to learn complex dependencies between distant inputs continues to
grow. In response, solutions that exploit the structure and sparsity of the
learned attention matrix have blossomed. However, real-world applications that
involve long sequences, such as biological sequence analysis, may fall short of
meeting these assumptions, precluding exploration of these models. To address
this challenge, we present a new Transformer architecture, Performer, based on
Fast Attention Via Orthogonal Random features (FAVOR). Our mechanism scales
linearly rather than quadratically in the number of tokens in the sequence, is
characterized by sub-quadratic space complexity and does not incorporate any
sparsity pattern priors. Furthermore, it provides strong theoretical
guarantees: unbiased estimation of the attention matrix and uniform
convergence. It is also backwards-compatible with pre-trained regular
Transformers. We demonstrate its effectiveness on the challenging task of
protein sequence modeling and provide detailed theoretical analysis.
Related papers
- Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights.
This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task.
We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z) - Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Dynamical Mean-Field Theory of Self-Attention Neural Networks [0.0]
Transformer-based models have demonstrated exceptional performance across diverse domains.
Little is known about how they operate or what are their expected dynamics.
We use methods for the study of asymmetric Hopfield networks in nonequilibrium regimes.
arXiv Detail & Related papers (2024-06-11T13:29:34Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - In-Context Convergence of Transformers [63.04956160537308]
We study the learning dynamics of a one-layer transformer with softmax attention trained via gradient descent.
For data with imbalanced features, we show that the learning dynamics take a stage-wise convergence process.
arXiv Detail & Related papers (2023-10-08T17:55:33Z) - DIFFormer: Scalable (Graph) Transformers Induced by Energy Constrained
Diffusion [66.21290235237808]
We introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states.
We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs.
Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks.
arXiv Detail & Related papers (2023-01-23T15:18:54Z) - Inductive Biases and Variable Creation in Self-Attention Mechanisms [25.79946667926312]
This work provides a theoretical analysis of the inductive biases of self-attention modules.
Our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent.
Our main result shows that bounded-norm Transformer layers create sparse variables.
arXiv Detail & Related papers (2021-10-19T16:36:19Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.