Align-gram : Rethinking the Skip-gram Model for Protein Sequence
Analysis
- URL: http://arxiv.org/abs/2012.03324v1
- Date: Sun, 6 Dec 2020 17:04:17 GMT
- Title: Align-gram : Rethinking the Skip-gram Model for Protein Sequence
Analysis
- Authors: Nabil Ibtehaz, S. M. Shakhawat Hossain Sourav, Md. Shamsuzzoha Bayzid,
M. Sohel Rahman
- Abstract summary: We propose a novel embedding scheme, Align-gram, which is capable of mapping the similar $k$-mers close to each other in a vector space.
Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.
- Score: 0.8733639720576208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: The inception of next generations sequencing technologies have
exponentially increased the volume of biological sequence data. Protein
sequences, being quoted as the `language of life', has been analyzed for a
multitude of applications and inferences.
Motivation: Owing to the rapid development of deep learning, in recent years
there have been a number of breakthroughs in the domain of Natural Language
Processing. Since these methods are capable of performing different tasks when
trained with a sufficient amount of data, off-the-shelf models are used to
perform various biological applications. In this study, we investigated the
applicability of the popular Skip-gram model for protein sequence analysis and
made an attempt to incorporate some biological insights into it.
Results: We propose a novel $k$-mer embedding scheme, Align-gram, which is
capable of mapping the similar $k$-mers close to each other in a vector space.
Furthermore, we experiment with other sequence-based protein representations
and observe that the embeddings derived from Align-gram aids modeling and
training deep learning models better. Our experiments with a simple baseline
LSTM model and a much complex CNN model of DeepGoPlus shows the potential of
Align-gram in performing different types of deep learning applications for
protein sequence analysis.
Related papers
- Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification [53.488387420073536]
Life-Code is a comprehensive framework that spans different biological functions.
Life-Code achieves state-of-the-art performance on various tasks across three omics.
arXiv Detail & Related papers (2025-02-11T06:53:59Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences [0.11049608786515838]
We propose a transformer-based fusion model capable of predicting Gene Ontology terms from full-scale protein sequences.
The model is able to understand both short and long term dependencies within the enzyme's structure and can precisely identify the motifs associated with the various GO terms.
arXiv Detail & Related papers (2024-12-08T02:09:45Z) - Modeling Multi-Step Scientific Processes with Graph Transformer Networks [0.0]
The viability of geometric learning for regression tasks was benchmarked against a collection of linear models.
A graph transformer network outperformed all tested linear models in scenarios that featured hidden interactions between process steps and sequence dependent features.
arXiv Detail & Related papers (2024-08-10T04:03:51Z) - Target-aware Variational Auto-encoders for Ligand Generation with
Multimodal Protein Representation Learning [2.01243755755303]
We introduce TargetVAE, a target-aware auto-encoder that generates with high binding affinities to arbitrary protein targets.
This is the first effort to unify different representations of proteins into a single model that we name as Protein Multimodal Network (PMN)
arXiv Detail & Related papers (2023-08-02T12:08:17Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Improving RNA Secondary Structure Design using Deep Reinforcement
Learning [69.63971634605797]
We propose a new benchmark of applying reinforcement learning to RNA sequence design, in which the objective function is defined to be the free energy in the sequence's secondary structure.
We show results of the ablation analysis that we do for these algorithms, as well as graphs indicating the algorithm's performance across batches.
arXiv Detail & Related papers (2021-11-05T02:54:06Z) - Modeling Protein Using Large-scale Pretrain Language Model [12.568452480689578]
Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets.
Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences.
Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
arXiv Detail & Related papers (2021-08-17T04:13:11Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Interpretable Structured Learning with Sparse Gated Sequence Encoder for
Protein-Protein Interaction Prediction [2.9488233765621295]
Predicting protein-protein interactions (PPIs) by learning informative representations from amino acid sequences is a challenging yet important problem in biology.
We present a novel deep framework to model and predict PPIs from sequence alone.
Our model incorporates a bidirectional gated recurrent unit to learn sequence representations by leveraging contextualized and sequential information from sequences.
arXiv Detail & Related papers (2020-10-16T17:13:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.