PCD2Vec: A Poisson Correction Distance-Based Approach for Viral Host
Classification
- URL: http://arxiv.org/abs/2304.06731v1
- Date: Thu, 13 Apr 2023 03:02:22 GMT
- Title: PCD2Vec: A Poisson Correction Distance-Based Approach for Viral Host
Classification
- Authors: Sarwan Ali, Taslim Murad, Murray Patterson
- Abstract summary: Coronaviruses are membrane-enveloped, non-segmented positive-strand RNA viruses belonging to the Coronaviridae family.
In the Coronavirus genome, an essential structural region is the spike region, and it's responsible for attaching the virus to the host cell membrane.
We propose a novel method for predicting the host specificity of coronaviruses by analyzing spike protein sequences from different viral subgenera and species.
- Score: 0.966840768820136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Coronaviruses are membrane-enveloped, non-segmented positive-strand RNA
viruses belonging to the Coronaviridae family. Various animal species, mainly
mammalian and avian, are severely infected by various coronaviruses, causing
serious concerns like the recent pandemic (COVID-19). Therefore, building a
deeper understanding of these viruses is essential to devise prevention and
mitigation mechanisms. In the Coronavirus genome, an essential structural
region is the spike region, and it's responsible for attaching the virus to the
host cell membrane. Therefore, the usage of only the spike protein, instead of
the full genome, provides most of the essential information for performing
analyses such as host classification. In this paper, we propose a novel method
for predicting the host specificity of coronaviruses by analyzing spike protein
sequences from different viral subgenera and species. Our method involves using
the Poisson correction distance to generate a distance matrix, followed by
using a radial basis function (RBF) kernel and kernel principal component
analysis (PCA) to generate a low-dimensional embedding. Finally, we apply
classification algorithms to the low-dimensional embedding to generate the
resulting predictions of the host specificity of coronaviruses. We provide
theoretical proofs for the non-negativity, symmetry, and triangle inequality
properties of the Poisson correction distance metric, which are important
properties in a machine-learning setting. By encoding the spike protein
structure and sequences using this comprehensive approach, we aim to uncover
hidden patterns in the biological sequences to make accurate predictions about
host specificity. Finally, our classification results illustrate that our
method can achieve higher predictive accuracy and improve performance over
existing baselines.
Related papers
- Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Virus2Vec: Viral Sequence Classification Using Machine Learning [48.40285316053593]
We propose Virus2Vec, a feature-vector representation for viral sequences that enable machine learning models to identify viral hosts.
We empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host.
Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
arXiv Detail & Related papers (2023-04-24T08:17:16Z) - ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z) - Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences [4.289396744209968]
Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups.
Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences.
In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels.
arXiv Detail & Related papers (2022-07-28T00:54:54Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Transfer Learning for Protein Structure Classification at Low Resolution [124.5573289131546]
We show that it is possible to make accurate ($geq$80%) predictions of protein class and architecture from structures determined at low ($leq$3A) resolution.
We provide proof of concept for high-speed, low-cost protein structure classification at low resolution, and a basis for extension to prediction of function.
arXiv Detail & Related papers (2020-08-11T15:01:32Z) - Statistical Linear Models in Virus Genomic Alignment-free Classification: Application to Hepatitis C Viruses [2.900522306460408]
This study explores the power of linear classifiers in genotyping and subtyping partial and complete genomes.
It is applied to the Hepatitis C viruses (HCV)
Overall, several classifiers perform well given a set of precise combination of the experimental variables.
arXiv Detail & Related papers (2019-10-11T21:40:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.