Tranception: protein fitness prediction with autoregressive transformers
and inference-time retrieval
- URL: http://arxiv.org/abs/2205.13760v1
- Date: Fri, 27 May 2022 04:51:15 GMT
- Title: Tranception: protein fitness prediction with autoregressive transformers
and inference-time retrieval
- Authors: Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado,
Aidan Gomez, Debora S. Marks, Yarin Gal
- Abstract summary: The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications.
Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks.
Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems.
- Score: 23.49976148784686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to accurately model the fitness landscape of protein sequences is
critical to a wide range of applications, from quantifying the effects of human
variants on disease likelihood, to predicting immune-escape mutations in
viruses and designing novel biotherapeutic proteins. Deep generative models of
protein sequences trained on multiple sequence alignments have been the most
successful approaches so far to address these tasks. The performance of these
methods is however contingent on the availability of sufficiently deep and
diverse alignments for reliable training. Their potential scope is thus limited
by the fact many protein families are hard, if not impossible, to align. Large
language models trained on massive quantities of non-aligned protein sequences
from diverse families address these problems and show potential to eventually
bridge the performance gap. We introduce Tranception, a novel transformer
architecture leveraging autoregressive predictions and retrieval of homologous
sequences at inference to achieve state-of-the-art fitness prediction
performance. Given its markedly higher performance on multiple mutants,
robustness to shallow alignments and ability to score indels, our approach
offers significant gain of scope over existing approaches. To enable more
rigorous model testing across a broader range of protein families, we develop
ProteinGym -- an extensive set of multiplexed assays of variant effects,
substantially increasing both the number and diversity of assays compared to
existing benchmarks.
Related papers
- SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Protein Conformation Generation via Force-Guided SE(3) Diffusion Models [48.48934625235448]
Deep generative modeling techniques have been employed to generate novel protein conformations.
We propose a force-guided SE(3) diffusion model, ConfDiff, for protein conformation generation.
arXiv Detail & Related papers (2024-03-21T02:44:08Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [74.64101864289572]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Efficiently Predicting Protein Stability Changes Upon Single-point
Mutation with Large Language Models [51.57843608615827]
The ability to precisely predict protein thermostability is pivotal for various subfields and applications in biochemistry.
We introduce an ESM-assisted efficient approach that integrates protein sequence and structural features to predict the thermostability changes in protein upon single-point mutations.
arXiv Detail & Related papers (2023-12-07T03:25:49Z) - Multi-level Protein Representation Learning for Blind Mutational Effect
Prediction [5.207307163958806]
This paper introduces a novel pre-training framework that cascades sequential and geometric analyzers for protein structures.
It guides mutational directions toward desired traits by simulating natural selection on wild-type proteins.
We assess the proposed approach using a public database and two new databases for a variety of variant effect prediction tasks.
arXiv Detail & Related papers (2023-06-08T03:00:50Z) - Accurate and Definite Mutational Effect Prediction with Lightweight
Equivariant Graph Neural Networks [2.381587712372268]
This research introduces a lightweight graph representation learning scheme that efficiently analyzes the microenvironment of wild-type proteins.
Our solution offers a wide range of benefits that make it an ideal choice for the community.
arXiv Detail & Related papers (2023-04-13T09:51:49Z) - Plug & Play Directed Evolution of Proteins with Gradient-based Discrete
MCMC [1.0499611180329804]
A long-standing goal of machine-learning-based protein engineering is to accelerate the discovery of novel mutations.
We introduce a sampling framework for evolving proteins in silico that supports mixing and matching a variety of unsupervised models.
By composing these models, we aim to improve our ability to evaluate unseen mutations and constrain search to regions of sequence space likely to contain functional proteins.
arXiv Detail & Related papers (2022-12-20T00:26:23Z) - Few Shot Protein Generation [4.7210697296108926]
We present the MSA-to-protein transformer, a generative model of protein sequences conditioned on protein families represented by multiple sequence alignments (MSAs)
Unlike existing approaches to learning generative models of protein families, the MSA-to-protein transformer conditions sequence generation directly on a learned encoding of the multiple sequence alignment.
Our generative approach accurately models epistasis and indels and allows for exact inference and efficient sampling unlike other approaches.
arXiv Detail & Related papers (2022-04-03T22:14:02Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.