PEvoLM: Protein Sequence Evolutionary Information Language Model
- URL: http://arxiv.org/abs/2308.08578v1
- Date: Wed, 16 Aug 2023 06:46:28 GMT
- Title: PEvoLM: Protein Sequence Evolutionary Information Language Model
- Authors: Issar Arab
- Abstract summary: A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs)
This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation.
The model was trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the exponential increase of the protein sequence databases over time,
multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive
and time-consuming database search to retrieve evolutionary information. The
resulting position-specific scoring matrices (PSSMs) of such search engines
represent a crucial input to many machine learning (ML) models in the field of
bioinformatics and computational biology. A protein sequence is a collection of
contiguous tokens or characters called amino acids (AAs). The analogy to
natural language allowed us to exploit the recent advancements in the field of
Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art
algorithms to bioinformatics. This research presents an Embedding Language
Model (ELMo), converting a protein sequence to a numerical vector
representation. While the original ELMo trained a 2-layer bidirectional Long
Short-Term Memory (LSTMs) network following a two-path architecture, one for
the forward and the second for the backward pass, by merging the idea of PSSMs
with the concept of transfer-learning, this work introduces a novel
bidirectional language model (bi-LM) with four times less free parameters and
using rather a single path for both passes. The model was trained not only on
predicting the next AA but also on the probability distribution of the next AA
derived from similar, yet different sequences as summarized in a PSSM,
simultaneously for multi-task learning, hence learning evolutionary information
of protein sequences as well. The network architecture and the pre-trained
model are made available as open source under the permissive MIT license on
GitHub at https://github.com/issararab/PEvoLM.
Related papers
- In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Multilingual Sequence-to-Sequence Models for Hebrew NLP [16.010560946005473]
We show that sequence-to-sequence generative architectures are more suitable for morphologically rich languages (MRLs) such as Hebrew.
We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5.
arXiv Detail & Related papers (2022-12-19T18:10:23Z) - Protein language models trained on multiple sequence alignments learn
phylogenetic relationships [0.5639904484784126]
Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction.
We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs.
arXiv Detail & Related papers (2022-03-29T12:07:45Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Align-gram : Rethinking the Skip-gram Model for Protein Sequence
Analysis [0.8733639720576208]
We propose a novel embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space.
Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
arXiv Detail & Related papers (2020-12-06T17:04:17Z) - Pre-training Protein Language Models with Label-Agnostic Binding Pairs
Enhances Performance in Downstream Tasks [1.452875650827562]
Less than 1% of protein sequences are structurally and functionally annotated.
We present a modification to the RoBERTa model by inputting a mixture of binding and non-binding protein sequences.
We suggest that Transformer's attention mechanism contributes to protein binding site discovery.
arXiv Detail & Related papers (2020-12-05T17:37:41Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.