Classifying COVID-19 Spike Sequences from Geographic Location Using Deep
Learning
- URL: http://arxiv.org/abs/2110.00809v1
- Date: Sat, 2 Oct 2021 14:09:30 GMT
- Title: Classifying COVID-19 Spike Sequences from Geographic Location Using Deep
Learning
- Authors: Sarwan Ali, Babatunde Bello, Murray Patterson
- Abstract summary: We propose an algorithm that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using $k$-merss.
We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: With the rapid spread of COVID-19 worldwide, viral genomic data is available
in the order of millions of sequences on public databases such as GISAID. This
\emph{Big Data} creates a unique opportunity for analysis towards the research
of effective vaccine development for current pandemics, and avoiding or
mitigating future pandemics. One piece of information that comes with every
such viral sequence is the geographical location where it was collected -- the
patterns found between viral variants and geographic location surely being an
important part of this analysis. One major challenge that researchers face is
processing such huge, highly dimensional data to get useful insights as quickly
as possible. Most of the existing methods face scalability issues when dealing
with the magnitude of such data. In this paper, we propose an algorithm that
first computes a numerical representation of the spike protein sequence of
SARS-CoV-2 using $k$-mers substrings) and then uses a deep learning-based model
to classify the sequences in terms of geographical location. We show that our
proposed model significantly outperforms the baselines. We also show the
importance of different amino acids in the spike sequences by computing the
information gain corresponding to the true class labels.
Related papers
- Domain Adaptive Synapse Detection with Weak Point Annotations [63.97144211520869]
We present AdaSyn, a framework for domain adaptive synapse detection with weak point annotations.
In the WASPSYN challenge at I SBI 2023, our method ranks the 1st place.
arXiv Detail & Related papers (2023-08-31T05:05:53Z) - Spatial Implicit Neural Representations for Global-Scale Species Mapping [72.92028508757281]
Given a set of locations where a species has been observed, the goal is to build a model to predict whether the species is present or absent at any location.
Traditional methods struggle to take advantage of emerging large-scale crowdsourced datasets.
We use Spatial Implicit Neural Representations (SINRs) to jointly estimate the geographical range of 47k species simultaneously.
arXiv Detail & Related papers (2023-06-05T03:36:01Z) - ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z) - Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Efficient Analysis of COVID-19 Clinical Data using Machine Learning
Models [0.0]
Huge volumes of data and case studies have been made available, providing researchers with a unique opportunity to find trends.
Applying machine learning based algorithms to this big data is a natural approach to take to this aim.
We show that with the efficient feature selection algorithm, we can achieve a prediction accuracy of more than 90% in most cases.
arXiv Detail & Related papers (2021-10-18T20:06:01Z) - Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences [0.0]
Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
arXiv Detail & Related papers (2021-09-12T03:16:27Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Early Detection of COVID-19 Hotspots Using Spatio-Temporal Data [66.70036251870988]
The Centers for Disease Control and Prevention (CDC) has worked with other federal agencies to identify counties with increasing coronavirus 2019 (CO-19) incidence (hotspots)
This paper presents a sparse model for early detection of COVID-19 hotspots (at the county level) in the United States.
Deep neural networks are introduced to enhance the model's representative power while still enjoying the interpretability of the kernel.
arXiv Detail & Related papers (2021-05-31T19:28:17Z) - Deep learning for time series classification [2.0305676256390934]
Time series analysis allows us to visualize and understand the evolution of a process over time.
Time series classification consists of constructing algorithms dedicated to automatically label time series data.
Deep learning has emerged as one of the most effective methods for tackling the supervised classification task.
arXiv Detail & Related papers (2020-10-01T17:38:40Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.