Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads
Data
- URL: http://arxiv.org/abs/2211.08267v1
- Date: Tue, 15 Nov 2022 16:19:23 GMT
- Title: Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads
Data
- Authors: Prakash Chourasia, Sarwan Ali, Simone Ciccolella, Gianluca Della
Vedova, Murray Patterson
- Abstract summary: We propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from raw sequencing reads without requiring assembly.
Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines.
- Score: 2.362412515574206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The massive amount of genomic data appearing for SARS-CoV-2 since the
beginning of the COVID-19 pandemic has challenged traditional methods for
studying its dynamics. As a result, new methods such as Pangolin, which can
scale to the millions of samples of SARS-CoV-2 currently available, have
appeared. Such a tool is tailored to take as input assembled, aligned and
curated full-length sequences, such as those found in the GISAID database. As
high-throughput sequencing technologies continue to advance, such assembly,
alignment and curation may become a bottleneck, creating a need for methods
which can process raw sequencing reads directly.
In this paper, we propose Reads2Vec, an alignment-free embedding approach
that can generate a fixed-length feature vector representation directly from
the raw sequencing reads without requiring assembly. Furthermore, since such an
embedding is a numerical representation, it may be applied to highly optimized
classification and clustering algorithms. Experiments on simulated data show
that our proposed embedding obtains better classification results and better
clustering properties contrary to existing alignment-free baselines. In a study
on real data, we show that alignment-free embeddings have better clustering
properties than the Pangolin tool and that the spike region of the SARS-CoV-2
genome heavily informs the alignment-free clusterings, which is consistent with
current biological knowledge of SARS-CoV-2.
Related papers
- GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - ViralVectors: Compact and Scalable Alignment-free Virome Feature
Generation [0.7874708385247353]
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus.
We propose emphsignatures, a compact vector generation from virome sequencing data that allows effective downstream analysis.
arXiv Detail & Related papers (2023-04-06T06:46:17Z) - Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network
Model [0.0]
SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans.
It can adapt to different hosts and evolve into different lineages.
It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein.
arXiv Detail & Related papers (2022-11-19T00:34:02Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Clue Me In: Semi-Supervised FGVC with Out-of-Distribution Data [44.90231337626545]
We propose a novel design specifically aimed at making out-of-distribution data work for semi-supervised visual classification.
Our experimental results reveal that (i) the proposed method yields good robustness against out-of-distribution data, and (ii) it can be equipped with prior arts, boosting their performance.
arXiv Detail & Related papers (2021-12-06T07:22:10Z) - Robust Representation and Efficient Feature Selection Allows for
Effective Clustering of SARS-CoV-2 Variants [0.0]
The SARS-CoV-2 virus contains different variants, each of them having different mutations.
Much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence.
We propose an approach to cluster spike protein sequences in order to study the behavior of different known variants.
arXiv Detail & Related papers (2021-10-18T21:18:52Z) - Variational Auto Encoder Gradient Clustering [0.0]
Clustering using deep neural network models have been extensively studied in recent years.
This article investigates how probability function gradient ascent can be used to process data in order to achieve better clustering.
We propose a simple yet effective method for investigating suitable number of clusters for data, based on the DBSCAN clustering algorithm.
arXiv Detail & Related papers (2021-05-11T08:00:36Z) - Binary Classification from Multiple Unlabeled Datasets via Surrogate Set
Classification [94.55805516167369]
We propose a new approach for binary classification from m U-sets for $mge2$.
Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC)
arXiv Detail & Related papers (2021-02-01T07:36:38Z) - Improving Generative Adversarial Networks with Local Coordinate Coding [150.24880482480455]
Generative adversarial networks (GANs) have shown remarkable success in generating realistic data from some predefined prior distribution.
In practice, semantic information might be represented by some latent distribution learned from data.
We propose an LCCGAN model with local coordinate coding (LCC) to improve the performance of generating data.
arXiv Detail & Related papers (2020-07-28T09:17:50Z) - LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset.
Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation.
We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z) - A Novel Granular-Based Bi-Clustering Method of Deep Mining the
Co-Expressed Genes [76.84066556597342]
Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions.
Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters.
We propose a novel bi-clustering method by involving here the theory of Granular Computing.
arXiv Detail & Related papers (2020-05-12T02:04:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.