Align-gram : Rethinking the Skip-gram Model for Protein Sequence
Analysis
- URL: http://arxiv.org/abs/2012.03324v1
- Date: Sun, 6 Dec 2020 17:04:17 GMT
- Title: Align-gram : Rethinking the Skip-gram Model for Protein Sequence
Analysis
- Authors: Nabil Ibtehaz, S. M. Shakhawat Hossain Sourav, Md. Shamsuzzoha Bayzid,
M. Sohel Rahman
- Abstract summary: We propose a novel embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space.
Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
- Score: 0.8733639720576208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: The inception of next generations sequencing technologies have
exponentially increased the volume of biological sequence data. Protein
sequences, being quoted as the `language of life', has been analyzed for a
multitude of applications and inferences.
Motivation: Owing to the rapid development of deep learning, in recent years
there have been a number of breakthroughs in the domain of Natural Language
Processing. Since these methods are capable of performing different tasks when
trained with a sufficient amount of data, off-the-shelf models are used to
perform various biological applications. In this study, we investigated the
applicability of the popular Skip-gram model for protein sequence analysis and
made an attempt to incorporate some biological insights into it.
Results: We propose a novel $k$-mer embedding scheme, Align-gram, which is
capable of mapping the similar $k$-mers close to each other in a vector space.
Furthermore, we experiment with other sequence-based protein representations
and observe that the embeddings derived from Align-gram aids modeling and
training deep learning models better. Our experiments with a simple baseline
LSTM model and a much complex CNN model of DeepGoPlus shows the potential of
Align-gram in performing different types of deep learning applications for
protein sequence analysis.
Related papers
- Modeling Multi-Step Scientific Processes with Graph Transformer Networks [0.0]
The viability of geometric learning for regression tasks was benchmarked against a collection of linear models.
A graph transformer network outperformed all tested linear models in scenarios that featured hidden interactions between process steps and sequence dependent features.
arXiv Detail & Related papers (2024-08-10T04:03:51Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Target-aware Variational Auto-encoders for Ligand Generation with
Multimodal Protein Representation Learning [2.01243755755303]
We introduce TargetVAE, a target-aware auto-encoder that generates with high binding affinities to arbitrary protein targets.
This is the first effort to unify different representations of proteins into a single model that we name as Protein Multimodal Network (PMN)
arXiv Detail & Related papers (2023-08-02T12:08:17Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Dynamic Latent Separation for Deep Learning [67.62190501599176]
A core problem in machine learning is to learn expressive latent variables for model prediction on complex data.
Here, we develop an approach that improves expressiveness, provides partial interpretation, and is not restricted to specific applications.
arXiv Detail & Related papers (2022-10-07T17:56:53Z) - Improving RNA Secondary Structure Design using Deep Reinforcement
Learning [69.63971634605797]
We propose a new benchmark of applying reinforcement learning to RNA sequence design, in which the objective function is defined to be the free energy in the sequence's secondary structure.
We show results of the ablation analysis that we do for these algorithms, as well as graphs indicating the algorithm's performance across batches.
arXiv Detail & Related papers (2021-11-05T02:54:06Z) - Modeling Protein Using Large-scale Pretrain Language Model [12.568452480689578]
Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets.
Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences.
Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
arXiv Detail & Related papers (2021-08-17T04:13:11Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Interpretable Structured Learning with Sparse Gated Sequence Encoder for
Protein-Protein Interaction Prediction [2.9488233765621295]
Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology.
We present a novel deep framework to model and predict PPIs from sequence alone.
Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences.
arXiv Detail & Related papers (2020-10-16T17:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.