Linguistically inspired roadmap for building biologically reliable
protein language models
- URL: http://arxiv.org/abs/2207.00982v2
- Date: Fri, 28 Apr 2023 15:33:39 GMT
- Title: Linguistically inspired roadmap for building biologically reliable
protein language models
- Authors: Mai Ha Vu, Rahmad Akbar, Philippe A. Robert, Bartlomiej Swiatczak,
Victor Greiff, Geir Kjetil Sandve, Dag Trygve Truslew Haug
- Abstract summary: We argue that guidance drawn from linguistics can aid with building more interpretable protein LMs.
We provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation.
- Score: 0.5412332666265471
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Deep neural-network-based language models (LMs) are increasingly applied to
large-scale protein sequence data to predict protein function. However, being
largely black-box models and thus challenging to interpret, current protein LM
approaches do not contribute to a fundamental understanding of
sequence-function mappings, hindering rule-based biotherapeutic drug
development. We argue that guidance drawn from linguistics, a field specialized
in analytical rule extraction from natural language data, can aid with building
more interpretable protein LMs that are more likely to learn relevant
domain-specific rules. Differences between protein sequence data and linguistic
sequence data require the integration of more domain-specific knowledge in
protein LMs compared to natural language LMs. Here, we provide a
linguistics-based roadmap for protein LM pipeline choices with regard to
training data, tokenization, token embedding, sequence embedding, and model
interpretation. Incorporating linguistic ideas into protein LMs enables the
development of next-generation interpretable machine-learning models with the
potential of uncovering the biological mechanisms underlying sequence-function
relationships.
Related papers
- Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training [82.37346937497136]
We propose a versatile cross-modal large language model (LLM) for both protein-centric and protein-language tasks.
ProtLLM features a unique dynamic protein mounting mechanism, enabling it to handle complex inputs.
By developing a specialized protein vocabulary, we equip the model with the capability to predict not just natural language but also proteins from a vast pool of candidates.
arXiv Detail & Related papers (2024-02-28T01:29:55Z) - Endowing Protein Language Models with Structural Knowledge [5.587293092389789]
We introduce a novel framework that enhances protein language models by integrating protein structural data.
The refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database.
PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction.
arXiv Detail & Related papers (2024-01-26T12:47:54Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - InstructProtein: Aligning Human and Protein Language via Knowledge
Instruction [38.46621806898224]
Large Language Models (LLMs) have revolutionized the field of natural language processing, but they fall short in comprehending biological sequences such as proteins.
We propose InstructProtein, which possesses bidirectional generation capabilities in both human and protein languages.
InstructProtein serves as a pioneering step towards text-based protein function prediction and sequence design.
arXiv Detail & Related papers (2023-10-05T02:45:39Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Structure-aware Protein Self-supervised Learning [50.04673179816619]
We propose a novel structure-aware protein self-supervised learning method to capture structural information of proteins.
In particular, a well-designed graph neural network (GNN) model is pretrained to preserve the protein structural information.
We identify the relation between the sequential information in the protein language model and the structural information in the specially designed GNN model via a novel pseudo bi-level optimization scheme.
arXiv Detail & Related papers (2022-04-06T02:18:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.