Virus2Vec: Viral Sequence Classification Using Machine Learning
- URL: http://arxiv.org/abs/2304.12328v1
- Date: Mon, 24 Apr 2023 08:17:16 GMT
- Title: Virus2Vec: Viral Sequence Classification Using Machine Learning
- Authors: Sarwan Ali, Babatunde Bello, Prakash Chourasia, Ria Thazhe Punathil,
Pin-Yu Chen, Imdad Ullah Khan, Murray Patterson
- Abstract summary: We propose Virus2Vec, a feature-vector representation for viral sequences that enable machine learning models to identify viral hosts.
We empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host.
Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
- Score: 48.40285316053593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the host-specificity of different families of viruses sheds
light on the origin of, e.g., SARS-CoV-2, rabies, and other such zoonotic
pathogens in humans. It enables epidemiologists, medical professionals, and
policymakers to curb existing epidemics and prevent future ones promptly. In
the family Coronaviridae (of which SARS-CoV-2 is a member), it is well-known
that the spike protein is the point of contact between the virus and the host
cell membrane. On the other hand, the two traditional mammalian orders,
Carnivora (carnivores) and Chiroptera (bats) are recognized to be responsible
for maintaining and spreading the Rabies Lyssavirus (RABV). We propose
Virus2Vec, a feature-vector representation for viral (nucleotide or amino acid)
sequences that enable vector-space-based machine learning models to identify
viral hosts. Virus2Vec generates numerical feature vectors for unaligned
sequences, allowing us to forego the computationally expensive sequence
alignment step from the pipeline. Virus2Vec leverages the power of both the
\emph{minimizer} and position weight matrix (PWM) to generate compact feature
vectors. Using several classifiers, we empirically evaluate Virus2Vec on
real-world spike sequences of Coronaviridae and rabies virus sequence data to
predict the host (identifying the reservoirs of infection). Our results
demonstrate that Virus2Vec outperforms the predictive accuracies of baseline
and state-of-the-art methods.
Related papers
- Opponent Shaping for Antibody Development [49.26728828005039]
Anti-viral therapies are typically designed to target only the current strains of a virus.
therapy-induced selective pressures act on viruses to drive the emergence of mutated strains, against which initial therapies have reduced efficacy.
We build on a computational model of binding between antibodies and viral antigens to implement a genetic simulation of viral evolutionary escape.
arXiv Detail & Related papers (2024-09-16T14:56:27Z) - A Conditional Flow Variational Autoencoder for Controllable Synthesis of
Virtual Populations of Anatomy [76.20367415712867]
We propose a conditional variational autoencoder (cVAE) with normalising flows to boost the flexibility and complexity of the approximate posterior learnt.
We demonstrate the performance of our conditional flow VAE using a data set of cardiac left ventricles acquired from 2360 patients.
arXiv Detail & Related papers (2023-06-26T13:23:52Z) - PCD2Vec: A Poisson Correction Distance-Based Approach for Viral Host
Classification [0.966840768820136]
Coronaviruses are membrane-enveloped, non-segmented positive-strand RNA viruses belonging to the Coronaviridae family.
In the Coronavirus genome, an essential structural region is the spike region, and it's responsible for attaching the virus to the host cell membrane.
We propose a novel method for predicting the host specificity of coronaviruses by analyzing spike protein sequences from different viral subgenera and species.
arXiv Detail & Related papers (2023-04-13T03:02:22Z) - Dense Feature Memory Augmented Transformers for COVID-19 Vaccination
Search Classification [60.49594822215981]
This paper presents a classification model for detecting COVID-19 vaccination related search queries.
We propose a novel approach of considering dense features as memory tokens that the model can attend to.
We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task.
arXiv Detail & Related papers (2022-12-16T13:57:41Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Accurate Virus Identification with Interpretable Raman Signatures by
Machine Learning [12.184128048998906]
We present a machine learning approach for analyzing Raman spectra of human and avian viruses.
A Convolutional Neural Network (CNN) classifier specifically designed for spectral data achieves very high accuracy for a variety of virus type or subtype identification tasks.
arXiv Detail & Related papers (2022-06-05T22:31:14Z) - PWM2Vec: An Efficient Embedding Approach for Viral Host Specification
from Coronavirus Spike Sequences [0.7340017786387767]
We study the different hosts which can be potential carriers and transmitters of deadly viruses to humans.
In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity.
We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call2Vec, and use to generate feature vectors from the spike protein sequences of coronaviruses.
arXiv Detail & Related papers (2022-01-06T23:25:54Z) - Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process.
The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million.
We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z) - Towards Interpreting Zoonotic Potential of Betacoronavirus Sequences
With Attention [17.406451433347527]
We apply an attention-enhanced long-short-term memory (LSTM) deep neural net classifier to a highly conserved viral protein target to predict zoonotic potential across betacoronaviruses.
Analysis and visualization of attention at the sequence and structure-level features indicate possible association between important protein-protein interactions governing viral replication in zoonotic betacoronaviruses and zoonotic transmission.
arXiv Detail & Related papers (2021-08-18T10:11:11Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.