Transformers trained on proteins can learn to attend to Euclidean distance
- URL: http://arxiv.org/abs/2502.01533v1
- Date: Mon, 03 Feb 2025 17:12:44 GMT
- Title: Transformers trained on proteins can learn to attend to Euclidean distance
- Authors: Isaac Ellmen, Constantin Schneider, Matthew I. J. Raybould, Charlotte M. Deane,
- Abstract summary: We show that Transformers can function independently as structure models when passed linear embeddings of coordinates.<n>We also show that pre-training protein Transformer encoders with structure improves performance on a downstream task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)-invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this theory using both simulated 3D points and in the context of masked token prediction for proteins. Finally, we show that pre-training protein Transformer encoders with structure improves performance on a downstream task, yielding better performance than custom structural models. Together, this work provides a basis for using standard Transformers as hybrid structure-language models.
Related papers
- Plain Transformers Can be Powerful Graph Learners [64.50059165186701]
Researchers have attempted to migrate Transformers to graph learning, but most advanced Graph Transformers have strayed far from plain Transformers.<n>This work demonstrates that the plain Transformer architecture can be a powerful graph learner.
arXiv Detail & Related papers (2025-04-17T02:06:50Z) - Disentangling Feature Structure: A Mathematically Provable Two-Stage Training Dynamics in Transformers [14.775485895006323]
This paper theoretically demonstrates how such two-stage training dynamics occur in transformers.
To our best known, this is the first rigorous result regarding a two-stage optimization process in transformers.
arXiv Detail & Related papers (2025-02-28T03:27:24Z) - (How) Can Transformers Predict Pseudo-Random Numbers? [7.201095605457193]
We study the ability of Transformers to learn pseudo-random number sequences from linear congruential generators (LCGs)
Our analysis reveals that Transformers can perform in-context prediction of LCG sequences with unseen moduli ($m$) and parameters ($a,c$)
arXiv Detail & Related papers (2025-02-14T18:59:40Z) - Converting Transformers into DGNNs Form [7.441691512676916]
We introduce a synthetic unitary digraph convolution based on the digraph Fourier transform.<n>The resulting model, which we term Converter, effectively converts a Transformer into a Directed Graph Neural Network form.<n>We have tested Converter on Long-Range Arena benchmark, long document classification, and DNA sequence-based taxonomy classification.
arXiv Detail & Related papers (2025-02-01T22:44:46Z) - Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory [11.3128832831327]
Increasing the size of a Transformer does not always lead to enhanced performance.<n>We present a theoretical framework that sheds light on the memorization during pre-training of transformer-based language models.
arXiv Detail & Related papers (2024-05-14T15:48:36Z) - On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting.
Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Structural Biases for Improving Transformers on Translation into
Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure.
The second method imbues structure at the data level by segmenting the data with morphological tokenization.
We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z) - Your Transformer May Not be as Powerful as You Expect [88.11364619182773]
We mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions.
We present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is.
We develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions.
arXiv Detail & Related papers (2022-05-26T14:51:30Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z) - SIT3: Code Summarization with Structure-Induced Transformer [48.000063280183376]
We propose a novel model based on structure-induced self-attention, which encodes sequential inputs with highly-effective structure modeling.
Our newly-proposed model achieves new state-of-the-art results on popular benchmarks.
arXiv Detail & Related papers (2020-12-29T11:37:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.