PWM2Vec: An Efficient Embedding Approach for Viral Host Specification
from Coronavirus Spike Sequences
- URL: http://arxiv.org/abs/2201.02273v1
- Date: Thu, 6 Jan 2022 23:25:54 GMT
- Title: PWM2Vec: An Efficient Embedding Approach for Viral Host Specification
from Coronavirus Spike Sequences
- Authors: Sarwan Ali, Babatunde Bello, Prakash Chourasia, Ria Thazhe Punathil,
Yijing Zhou, Murray Patterson
- Abstract summary: We study the different hosts which can be potential carriers and transmitters of deadly viruses to humans.
In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity.
We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call2Vec, and use to generate feature vectors from the spike protein sequences of coronaviruses.
- Score: 0.7340017786387767
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: COVID-19 pandemic, is still unknown and is an important open question. There
are speculations that bats are a possible origin. Likewise, there are many
closely related (corona-) viruses, such as SARS, which was found to be
transmitted through civets. The study of the different hosts which can be
potential carriers and transmitters of deadly viruses to humans is crucial to
understanding, mitigating and preventing current and future pandemics. In
coronaviruses, the surface (S) protein, or spike protein, is an important part
of determining host specificity since it is the point of contact between the
virus and the host cell membrane. In this paper, we classify the hosts of over
five thousand coronaviruses from their spike protein sequences, segregating
them into clusters of distinct hosts among avians, bats, camels, swines, humans
and weasels, to name a few. We propose a feature embedding based on the
well-known position-weight matrix (PWM), which we call PWM2Vec, and use to
generate feature vectors from the spike protein sequences of these
coronaviruses. While our embedding is inspired by the success of PWMs in
biological applications such as determining protein function, or identifying
transcription factor binding sites, we are the first (to the best of our
knowledge) to use PWMs in the context of host classification from viral
sequences to generate a fixed-length feature vector representation. The results
on the real world data show that in using PWM2Vec, we are able to perform
comparably well as compared to baseline models. We also measure the importance
of different amino acids using information gain to show the amino acids which
are important for predicting the host of a given coronavirus.
Related papers
- Opponent Shaping for Antibody Development [49.26728828005039]
Anti-viral therapies are typically designed to target only the current strains of a virus.
therapy-induced selective pressures act on viruses to drive the emergence of mutated strains, against which initial therapies have reduced efficacy.
We build on a computational model of binding between antibodies and viral antigens to implement a genetic simulation of viral evolutionary escape.
arXiv Detail & Related papers (2024-09-16T14:56:27Z) - Virus2Vec: Viral Sequence Classification Using Machine Learning [48.40285316053593]
We propose Virus2Vec, a feature-vector representation for viral sequences that enable machine learning models to identify viral hosts.
We empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host.
Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
arXiv Detail & Related papers (2023-04-24T08:17:16Z) - PCD2Vec: A Poisson Correction Distance-Based Approach for Viral Host
Classification [0.966840768820136]
Coronaviruses are membrane-enveloped, non-segmented positive-strand RNA viruses belonging to the Coronaviridae family.
In the Coronavirus genome, an essential structural region is the spike region, and it's responsible for attaching the virus to the host cell membrane.
We propose a novel method for predicting the host specificity of coronaviruses by analyzing spike protein sequences from different viral subgenera and species.
arXiv Detail & Related papers (2023-04-13T03:02:22Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Domain Invariant Model with Graph Convolutional Network for Mammogram
Classification [49.691629817104925]
We propose a novel framework, namely Domain Invariant Model with Graph Convolutional Network (DIM-GCN)
We first propose a Bayesian network, which explicitly decomposes the latent variables into disease-related and other disease-irrelevant parts that are provable to be disentangled from each other.
To better capture the macroscopic features, we leverage the observed clinical attributes as a goal for reconstruction, via Graph Convolutional Network (GCN)
arXiv Detail & Related papers (2022-04-21T08:23:44Z) - Predicting Influenza A Viral Host Using PSSM and Word Embeddings [5.067354030054702]
We use various machine learning models with features derived from the position-specific scoring matrix (PSSM) to infer the origin host of viruses.
The results show that the performance of the PSSM-based model reaches the MCC around 95%, and the F1 around 96%.
arXiv Detail & Related papers (2022-01-04T14:05:49Z) - Classification of Influenza Hemagglutinin Protein Sequences using
Convolutional Neural Networks [8.397189036839956]
This paper focuses on accurately predicting if an Influenza type A virus can infect specific hosts, and more specifically, Human, Avian and Swine hosts, using only the protein sequence of the HA gene.
We propose encoding the protein sequences into numerical signals using the Hydrophobicity Index and subsequently utilising a Convolutional Neural Network-based predictive model.
As the results show, the proposed model can distinguish HA protein sequences with high accuracy whenever the virus under investigation can infect Human, Avian or Swine hosts.
arXiv Detail & Related papers (2021-08-09T10:42:26Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.