Related papers: Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

URL: http://arxiv.org/abs/2109.05019v1
Date: Sun, 12 Sep 2021 03:16:27 GMT
Title: Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences
Authors: Sarwan Ali; Murray Patterson
Abstract summary: Several million genomic sequences are publicly available on platforms such as GISAID. Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
Score: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such \textit{Big Data} creates a new opportunity for researchers to study this virus in detail. This is particularly important with all of the dynamics of the COVID-19 variants which emerge and circulate. This rich data source will give us insights on the best ways to perform genomic surveillance for this and future pandemic threats, with the ultimate goal of mitigating or eliminating such threats. Analyzing and processing the several million genomic sequences is a challenging task. Although traditional methods for sequence classification are proven to be effective, they are not designed to deal with these specific types of genomic sequences. Moreover, most of the existing methods also face the issue of scalability. Previous studies which were tailored to coronavirus genomic data proposed to use spike sequences (corresponding to a subsequence of the genome), rather than using the complete genomic sequence, to perform different machine learning (ML) tasks such as classification and clustering. However, those methods suffer from scalability issues. In this paper, we propose an approach called Spike2Vec, an efficient and scalable feature vector representation for each spike sequence that can be used for downstream ML tasks. Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1-score, etc.

Related papers

ViralVectors: Compact and Scalable Alignment-free Virome Feature Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z)
Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. It can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z)
Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences. Exact methods yield better classification performance, but they pose high computational costs. We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
Deep metric learning improves lab of origin prediction of genetically engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations. We propose a method, based on metric learning, that ranks the most likely labs-of-origin. We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)
Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants [0.0]
The SARS-CoV-2 virus contains different variants, each of them having different mutations. Much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence. We propose an approach to cluster spike protein sequences in order to study the behavior of different known variants.
arXiv Detail & Related papers (2021-10-18T21:18:52Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
Classifying COVID-19 Spike Sequences from Geographic Location Using Deep Learning [0.0]
We propose an algorithm that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using $k$-merss. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
arXiv Detail & Related papers (2021-10-02T14:09:30Z)
Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process. The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z)
A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
Two-step penalised logistic regression for multi-omic data with an application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately. Our approach should be preferred if the goal is to select as many relevant predictors as possible. Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.