Effective and scalable clustering of SARS-CoV-2 sequences
- URL: http://arxiv.org/abs/2108.08143v1
- Date: Wed, 18 Aug 2021 13:32:43 GMT
- Title: Effective and scalable clustering of SARS-CoV-2 sequences
- Authors: Sarwan Ali, Tamkanat-E-Ali, Muhammad Asad Khan, Imdadullah Khan,
Murray Patterson
- Abstract summary: SARS-CoV-2 continues to mutate as it spreads, according to an evolutionary process.
The number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million.
We propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants.
- Score: 0.41998444721319206
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: SARS-CoV-2, like any other virus, continues to mutate as it spreads,
according to an evolutionary process. Unlike any other virus, the number of
currently available sequences of SARS-CoV-2 in public databases such as GISAID
is already several million. This amount of data has the potential to uncover
the evolutionary dynamics of a virus like never before. However, a million is
already several orders of magnitude beyond what can be processed by the
traditional methods designed to reconstruct a virus's evolutionary history,
such as those that build a phylogenetic tree. Hence, new and scalable methods
will need to be devised in order to make use of the ever increasing number of
viral sequences being collected.
Since identifying variants is an important part of understanding the
evolution of a virus, in this paper, we propose an approach based on clustering
sequences to identify the current major SARS-CoV-2 variants. Using a $k$-mer
based feature vector generation and efficient feature selection methods, our
approach is effective in identifying variants, as well as being efficient and
scalable to millions of sequences. Such a clustering method allows us to show
the relative proportion of each variant over time, giving the rate of spread of
each variant in different locations -- something which is important for vaccine
development and distribution. We also compute the importance of each amino acid
position of the spike protein in identifying a given variant in terms of
information gain. Positions of high variant-specific importance tend to agree
with those reported by the USA's Centers for Disease Control and Prevention
(CDC), further demonstrating our approach.
Related papers
- Virus2Vec: Viral Sequence Classification Using Machine Learning [48.40285316053593]
We propose Virus2Vec, a feature-vector representation for viral sequences that enable machine learning models to identify viral hosts.
We empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host.
Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
arXiv Detail & Related papers (2023-04-24T08:17:16Z) - Efficient Classification of SARS-CoV-2 Spike Sequences Using Federated
Learning [4.497217246897902]
We analyze SARS-CoV-2 spike sequences in a distributed way, without data sharing.
We achieve an overall accuracy of $93%$ on the coronavirus variant identification task.
We plan to use this proof-of-concept to implement a privacy-preserving pandemic response strategy.
arXiv Detail & Related papers (2023-02-17T04:41:39Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Using Deep Learning Sequence Models to Identify SARS-CoV-2 Divergence [1.9573380763700707]
SARS-CoV-2 is an upper respiratory system RNA virus that has caused over 3 million deaths and infecting over 150 million worldwide as of May 2021.
We propose a neural network model that leverages recurrent and convolutional units to take in amino acid sequences of spike proteins and classify corresponding clades.
arXiv Detail & Related papers (2021-11-12T07:52:11Z) - Robust Representation and Efficient Feature Selection Allows for
Effective Clustering of SARS-CoV-2 Variants [0.0]
The SARS-CoV-2 virus contains different variants, each of them having different mutations.
Much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence.
We propose an approach to cluster spike protein sequences in order to study the behavior of different known variants.
arXiv Detail & Related papers (2021-10-18T21:18:52Z) - Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences [0.0]
Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
arXiv Detail & Related papers (2021-09-12T03:16:27Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Epigenetic evolution of deep convolutional models [81.21462458089142]
We build upon a previously proposed neuroevolution framework to evolve deep convolutional models.
We propose a convolutional layer layout which allows kernels of different shapes and sizes to coexist within the same layer.
The proposed layout enables the size and shape of individual kernels within a convolutional layer to be evolved with a corresponding new mutation operator.
arXiv Detail & Related papers (2021-04-12T12:45:16Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.