Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling
- URL: http://arxiv.org/abs/2301.06568v1
- Date: Mon, 16 Jan 2023 19:04:45 GMT
- Title: Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling
- Authors: Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa,
Mohamed Elkerdawy, Charlotte Rochereau, and Burkhard Rost
- Abstract summary: We present Ankh, the first general-purpose protein language model trained on Google's TPU-v4.
Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As opposed to scaling-up protein language models (PLMs), we seek improving
performance via protein-specific optimization. Although the proportionality
between the language model size and the richness of its learned representations
is validated, we prioritize accessibility and pursue a path of data-efficient,
cost-reduced, and knowledge-guided optimization. Through over twenty
experiments ranging from masking, architecture, and pre-training data, we
derive insights from protein-specific experimentation into building a model
that interprets the language of life, optimally. We present Ankh, the first
general-purpose PLM trained on Google's TPU-v4 surpassing the state-of-the-art
performance with fewer parameters (<10% for pre-training, <7% for inference,
and <30% for the embedding dimension). We provide a representative range of
structure and function benchmarks where Ankh excels. We further provide a
protein variant generation analysis on High-N and One-N input data scales where
Ankh succeeds in learning protein evolutionary conservation-mutation trends and
introducing functional diversity while retaining key structural-functional
characteristics. We dedicate our work to promoting accessibility to research
innovation via attainable resources.
Related papers
- Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z) - Training on test proteins improves fitness, structure, and function prediction [18.176929152066872]
Self-supervised pre-training on large datasets is a common method to enhance generalization.
We introduce a method for self-supervised fine-tuning at test time, allowing models to adapt to the test protein of interest on the fly.
We show that our method leads to new state-of-the-art results on the standard benchmark for protein fitness prediction.
arXiv Detail & Related papers (2024-11-04T14:23:59Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization.
We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z) - Target-aware Variational Auto-encoders for Ligand Generation with
Multimodal Protein Representation Learning [2.01243755755303]
We introduce TargetVAE, a target-aware auto-encoder that generates with high binding affinities to arbitrary protein targets.
This is the first effort to unify different representations of proteins into a single model that we name as Protein Multimodal Network (PMN)
arXiv Detail & Related papers (2023-08-02T12:08:17Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data [66.11139091362078]
We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
arXiv Detail & Related papers (2022-02-06T20:07:35Z) - EBM-Fold: Fully-Differentiable Protein Folding Powered by Energy-based
Models [53.17320541056843]
We propose a fully-differentiable approach for protein structure optimization, guided by a data-driven generative network.
Our EBM-Fold approach can efficiently produce high-quality decoys, compared against traditional Rosetta-based structure optimization routines.
arXiv Detail & Related papers (2021-05-11T03:40:29Z) - PersGNN: Applying Topological Data Analysis and Geometric Deep Learning
to Structure-Based Protein Function Prediction [0.07340017786387766]
In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank.
We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis.
arXiv Detail & Related papers (2020-10-30T02:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.