Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences
- URL: http://arxiv.org/abs/2109.05019v1
- Date: Sun, 12 Sep 2021 03:16:27 GMT
- Title: Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences
- Authors: Sarwan Ali; Murray Patterson
- Abstract summary: Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: With the rapid global spread of COVID-19, more and more data related to this
virus is becoming available, including genomic sequence data. The total number
of genomic sequences that are publicly available on platforms such as GISAID is
currently several million, and is increasing with every day. The availability
of such \textit{Big Data} creates a new opportunity for researchers to study
this virus in detail. This is particularly important with all of the dynamics
of the COVID-19 variants which emerge and circulate. This rich data source will
give us insights on the best ways to perform genomic surveillance for this and
future pandemic threats, with the ultimate goal of mitigating or eliminating
such threats. Analyzing and processing the several million genomic sequences is
a challenging task. Although traditional methods for sequence classification
are proven to be effective, they are not designed to deal with these specific
types of genomic sequences. Moreover, most of the existing methods also face
the issue of scalability. Previous studies which were tailored to coronavirus
genomic data proposed to use spike sequences (corresponding to a subsequence of
the genome), rather than using the complete genomic sequence, to perform
different machine learning (ML) tasks such as classification and clustering.
However, those methods suffer from scalability issues. In this paper, we
propose an approach called Spike2Vec, an efficient and scalable feature vector
representation for each spike sequence that can be used for downstream ML
tasks. Through experiments, we show that Spike2Vec is not only scalable on
several million spike sequences, but also outperforms the baseline models in
terms of prediction accuracy, F1-score, etc.
Related papers
- Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z) - Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Robust Representation and Efficient Feature Selection Allows for
Effective Clustering of SARS-CoV-2 Variants [0.0]
The SARS-CoV-2 virus contains different variants, each of them having different mutations.
Much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence.
We propose an approach to cluster spike protein sequences in order to study the behavior of different known variants.
arXiv Detail & Related papers (2021-10-18T21:18:52Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Classifying COVID-19 Spike Sequences from Geographic Location Using Deep
Learning [0.0]
We propose an algorithm that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using $k$-merss.
We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
arXiv Detail & Related papers (2021-10-02T14:09:30Z) - Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process.
The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million.
We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.