ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation
- URL: http://arxiv.org/abs/2304.02891v2
- Date: Fri, 7 Apr 2023 11:58:23 GMT
- Title: ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation
- Authors: Sarwan Ali, Prakash Chourasia, Zahra Tayebi, Babatunde Bello, Murray
Patterson
- Abstract summary: The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
- Score: 0.7874708385247353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The amount of sequencing data for SARS-CoV-2 is several orders of magnitude
larger than any virus. This will continue to grow geometrically for SARS-CoV-2,
and other viruses, as many countries heavily finance genomic surveillance
efforts. Hence, we need methods for processing large amounts of sequence data
to allow for effective yet timely decision-making. Such data will come from
heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide
or amino acid sequencing reads pertaining to the whole genome or regions (e.g.,
spike) of interest. In this work, we propose \emph{ViralVectors}, a compact
feature vector generation from virome sequencing data that allows effective
downstream analysis. Such generation is based on \emph{minimizers}, a type of
lightweight "signature" of a sequence, used traditionally in assembly and read
mapping -- to our knowledge, the first use minimizers in this way. We validate
our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike
sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show
robustness to more genomic variability); and (c) 4K raw WGS reads sets taken
from nasal-swab PCR tests (to show the ability to process unassembled reads).
Our results show that ViralVectors outperforms current benchmarks in most
classification and clustering tasks.
Related papers
- Virus2Vec: Viral Sequence Classification Using Machine Learning [48.40285316053593]
We propose Virus2Vec, a feature-vector representation for viral sequences that enable machine learning models to identify viral hosts.
We empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host.
Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
arXiv Detail & Related papers (2023-04-24T08:17:16Z) - Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z) - Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads
Data [2.362412515574206]
We propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from raw sequencing reads without requiring assembly.
Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines.
arXiv Detail & Related papers (2022-11-15T16:19:23Z) - Efficient Approximate Kernel Based Spike Sequence Classification [56.2938724367661]
Machine learning models, such as SVM, require a definition of distance/similarity between pairs of sequences.
Exact methods yield better classification performance, but they pose high computational costs.
We propose a series of ways to improve the performance of the approximate kernel in order to enhance its predictive performance.
arXiv Detail & Related papers (2022-09-11T22:44:19Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Robust Representation and Efficient Feature Selection Allows for
Effective Clustering of SARS-CoV-2 Variants [0.0]
The SARS-CoV-2 virus contains different variants, each of them having different mutations.
Much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence.
We propose an approach to cluster spike protein sequences in order to study the behavior of different known variants.
arXiv Detail & Related papers (2021-10-18T21:18:52Z) - Classifying COVID-19 Spike Sequences from Geographic Location Using Deep
Learning [0.0]
We propose an algorithm that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using $k$-merss.
We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
arXiv Detail & Related papers (2021-10-02T14:09:30Z) - Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences [0.0]
Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
arXiv Detail & Related papers (2021-09-12T03:16:27Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z) - Searching Central Difference Convolutional Networks for Face
Anti-Spoofing [68.77468465774267]
Face anti-spoofing (FAS) plays a vital role in face recognition systems.
Most state-of-the-art FAS methods rely on stacked convolutions and expert-designed network.
Here we propose a novel frame level FAS method based on Central Difference Convolution (CDC)
arXiv Detail & Related papers (2020-03-09T12:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.