Robust Representation and Efficient Feature Selection Allows for
Effective Clustering of SARS-CoV-2 Variants
- URL: http://arxiv.org/abs/2110.09622v1
- Date: Mon, 18 Oct 2021 21:18:52 GMT
- Title: Robust Representation and Efficient Feature Selection Allows for
Effective Clustering of SARS-CoV-2 Variants
- Authors: Zahra Tayebi, Sarwan Ali, Murray Patterson
- Abstract summary: The SARS-CoV-2 virus contains different variants, each of them having different mutations.
Much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence.
We propose an approach to cluster spike protein sequences in order to study the behavior of different known variants.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The widespread availability of large amounts of genomic data on the
SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an
opportunity for researchers to analyze the disease at a level of detail unlike
any virus before it. One one had, this will help biologists, policy makers and
other authorities to make timely and appropriate decisions to control the
spread of the coronavirus. On the other hand, such studies will help to more
effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus
contains different variants, each of them having different mutations,
performing any analysis on such data becomes a difficult task. It is well known
that much of the variation in the SARS-CoV-2 genome happens disproportionately
in the spike region of the genome sequence -- the relatively short region which
codes for the spike protein(s). Hence, in this paper, we propose an approach to
cluster spike protein sequences in order to study the behavior of different
known variants that are increasing at very high rate throughout the world. We
use a k-mers based approach to first generate a fixed-length feature vector
representation for the spike sequences. We then show that with the appropriate
feature selection, we can efficiently and effectively cluster the spike
sequences based on the different variants. Using a publicly available set of
SARS-CoV-2 spike sequences, we perform clustering of these sequences using both
hard and soft clustering methods and show that with our feature selection
methods, we can achieve higher F1 scores for the clusters.
Related papers
- ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - SelectAugment: Hierarchical Deterministic Sample Selection for Data
Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner.
Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio.
In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z) - Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence [1.9573380763700707]
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021.
We propose a neural network model that leverages recurrent and convolutional units to take in amino acid sequences of spike proteins and classify corresponding clades.
arXiv Detail & Related papers (2021-11-12T07:52:11Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences [0.0]
Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
arXiv Detail & Related papers (2021-09-12T03:16:27Z) - Effective and scalable clustering of SARS-CoV-2 sequences [0.41998444721319206]
SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process.
The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million.
We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
arXiv Detail & Related papers (2021-08-18T13:32:43Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Early Detection of COVID-19 Hotspots Using Spatio-Temporal Data [66.70036251870988]
The Centers for Disease Control and Prevention (CDC) has worked with other federal agencies to identify counties with increasing coronavirus 2019 (CO-19) incidence (hotspots)
This paper presents a sparse model for early detection of COVID-19 hotspots (at the county level) in the United States.
Deep neural networks are introduced to enhance the model's representative power while still enjoying the interpretability of the kernel.
arXiv Detail & Related papers (2021-05-31T19:28:17Z) - Understanding the temporal evolution of COVID-19 research through
machine learning and natural language processing [66.63200823918429]
The outbreak of the novel coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been continuously affecting human lives and communities around the world.
We used multiple data sources, i.e., PubMed and ArXiv, and built several machine learning models to characterize the landscape of current COVID-19 research.
Our findings confirm the types of research available in PubMed and ArXiv differ significantly, with the former exhibiting greater diversity in terms of COVID-19 related issues.
arXiv Detail & Related papers (2020-07-22T18:02:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.