Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model
- URL: http://arxiv.org/abs/2211.10546v2
- Date: Tue, 22 Nov 2022 07:56:35 GMT
- Title: Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model
- Authors: Sarwan Ali
- Abstract summary: SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
Like many coronaviruses, it can adapt to different hosts and evolve into
different lineages. It is well-known that the major SARS-CoV-2 lineages are
characterized by mutations that happen predominantly in the spike protein.
Understanding the spike protein structure and how it can be perturbed is vital
for understanding and determining if a lineage is of concern. These are crucial
to identifying and controlling current outbreaks and preventing future
pandemics. Machine learning (ML) methods are a viable solution to this effort,
given the volume of available sequencing data, much of which is unaligned or
even unassembled. However, such ML methods require fixed-length numerical
feature vectors in Euclidean space to be applicable. Similarly, euclidean space
is not considered the best choice when working with the classification and
clustering tasks for biological sequences. For this purpose, we design a method
that converts the protein (spike) sequences into the sequence similarity
network (SSN). We can then use SSN as an input for the classical algorithms
from the graph mining domain for the typical tasks such as classification and
clustering to understand the data. We show that the proposed alignment-free
method is able to outperform the current SOTA method in terms of clustering
results. Similarly, we are able to achieve higher classification accuracy using
well-known Node2Vec-based embedding compared to other baseline embedding
approaches.
Related papers
- A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - Linear normalised hash function for clustering gene sequences and
identifying reference sequences from multiple sequence alignments [4.34040512215583]
A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed.
The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences.
arXiv Detail & Related papers (2023-11-29T11:51:05Z) - ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z) - DynGFN: Towards Bayesian Inference of Gene Regulatory Networks with
GFlowNets [81.75973217676986]
Gene regulatory networks (GRN) describe interactions between genes and their products that control gene expression and cellular function.
Existing methods either focus on challenge (1), identifying cyclic structure from dynamics, or on challenge (2) learning complex Bayesian posteriors over DAGs, but not both.
In this paper we leverage the fact that it is possible to estimate the "velocity" of gene expression with RNA velocity techniques to develop an approach that addresses both challenges.
arXiv Detail & Related papers (2023-02-08T16:36:40Z) - Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads
Data [2.362412515574206]
We propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from raw sequencing reads without requiring assembly.
Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines.
arXiv Detail & Related papers (2022-11-15T16:19:23Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences [0.0]
Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
arXiv Detail & Related papers (2021-09-12T03:16:27Z) - An Uncertainty-Driven GCN Refinement Strategy for Organ Segmentation [53.425900196763756]
We propose a segmentation refinement method based on uncertainty analysis and graph convolutional networks.
We employ the uncertainty levels of the convolutional network in a particular input volume to formulate a semi-supervised graph learning problem.
We show that our method outperforms the state-of-the-art CRF refinement method by improving the dice score by 1% for the pancreas and 2% for spleen.
arXiv Detail & Related papers (2020-12-06T18:55:07Z) - Towards Discriminability and Diversity: Batch Nuclear-norm Maximization
under Label Insufficient Situations [154.51144248210338]
Batch Nuclear-norm Maximization (BNM) is proposed to boost the learning under label insufficient learning scenarios.
BNM outperforms competitors and works well with existing well-known methods.
arXiv Detail & Related papers (2020-03-27T05:04:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.