Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences
- URL: http://arxiv.org/abs/2512.10147v1
- Date: Wed, 10 Dec 2025 23:03:10 GMT
- Title: Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences
- Authors: Sarwan Ali, Taslim Murad,
- Abstract summary: Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning.<n>Existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets.<n>In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences.
- Score: 4.970277730082774
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.
Related papers
- Neuromorphic Spiking Neural Network Based Classification of COVID-19 Spike Sequences [4.497217246897902]
We propose a neural network-based (NN) mechanism to perform an efficient analysis of the SARS-CoV-2 data.<n>In this paper, we introduce a pipeline that first converts the spike protein sequences into a fixed-length numerical representation and then uses Neuromorphic Spiking Neural Network to classify those sequences.
arXiv Detail & Related papers (2024-12-19T10:26:31Z) - Large-Scale Targeted Cause Discovery via Learning from Simulated Data [66.51307552703685]
We propose a novel machine learning approach for inferring causal variables of a target variable from observations.<n>We train a neural network using supervised learning on simulated data to infer causality.<n> Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks.
arXiv Detail & Related papers (2024-08-29T02:21:11Z) - Optimized Learning for X-Ray Image Classification for Multi-Class Disease Diagnoses with Accelerated Computing Strategies [0.0]
False positives introduce the risk of erroneously identifying non-existent conditions, leading to misdiagnosis and a decline in patient care quality.
This study introduces modified pre-trained ResNet models tailored for multi-class disease diagnosis of X-ray images.
We demonstrate substantial improvements in execution runtime between normal training and inference-accelerated training.
arXiv Detail & Related papers (2024-07-01T18:31:30Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads
Data [2.362412515574206]
We propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from raw sequencing reads without requiring assembly.
Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines.
arXiv Detail & Related papers (2022-11-15T16:19:23Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively.
We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z) - Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19
Spike Sequences [0.0]
Several million genomic sequences are publicly available on platforms such as GISAID.
Spike2Vec is an efficient and scalable feature vector representation for each spike sequence.
arXiv Detail & Related papers (2021-09-12T03:16:27Z) - Many-to-One Distribution Learning and K-Nearest Neighbor Smoothing for
Thoracic Disease Identification [83.6017225363714]
deep learning has become the most powerful computer-aided diagnosis technology for improving disease identification performance.
For chest X-ray imaging, annotating large-scale data requires professional domain knowledge and is time-consuming.
In this paper, we propose many-to-one distribution learning (MODL) and K-nearest neighbor smoothing (KNNS) methods to improve a single model's disease identification performance.
arXiv Detail & Related papers (2021-02-26T02:29:30Z) - STELAR: Spatio-temporal Tensor Factorization with Latent Epidemiological
Regularization [76.57716281104938]
We develop a tensor method to predict the evolution of epidemic trends for many regions simultaneously.
STELAR enables long-term prediction by incorporating latent temporal regularization through a system of discrete-time difference equations.
We conduct experiments using both county- and state-level COVID-19 data and show that our model can identify interesting latent patterns of the epidemic.
arXiv Detail & Related papers (2020-12-08T21:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.