Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing
- URL: http://arxiv.org/abs/2512.03158v1
- Date: Tue, 02 Dec 2025 19:04:05 GMT
- Title: Contrastive Deep Learning for Variant Detection in Wastewater Genomic Sequencing
- Authors: Adele Chinda, Richmond Azumah, Hemanth Demakethepalli Venkateswara,
- Abstract summary: We present a comprehensive framework for unsupervised viral variant detection using Vector-Quantized Variational Autoencoders (VQ-VAE)<n>VQ-VAE learns discrete codebooks of genomic patterns from k-mer tokenized sequences without requiring reference genomes or variant labels.<n>Our framework provides a scalable, interpretable approach to genomic surveillance with direct applications to public health monitoring.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wastewater-based genomic surveillance has emerged as a powerful tool for population-level viral monitoring, offering comprehensive insights into circulating viral variants across entire communities. However, this approach faces significant computational challenges stemming from high sequencing noise, low viral coverage, fragmented reads, and the complete absence of labeled variant annotations. Traditional reference-based variant calling pipelines struggle with novel mutations and require extensive computational resources. We present a comprehensive framework for unsupervised viral variant detection using Vector-Quantized Variational Autoencoders (VQ-VAE) that learns discrete codebooks of genomic patterns from k-mer tokenized sequences without requiring reference genomes or variant labels. Our approach extends the base VQ-VAE architecture with masked reconstruction pretraining for robustness to missing data and contrastive learning for highly discriminative embeddings. Evaluated on SARS-CoV-2 wastewater sequencing data comprising approximately 100,000 reads, our VQ-VAE achieves 99.52% mean token-level accuracy and 56.33% exact sequence match rate while maintaining 19.73% codebook utilization (101 of 512 codes active), demonstrating efficient discrete representation learning. Contrastive fine-tuning with different projection dimensions yields substantial clustering improvements: 64-dimensional embeddings achieve +35% Silhouette score improvement (0.31 to 0.42), while 128-dimensional embeddings achieve +42% improvement (0.31 to 0.44), clearly demonstrating the impact of embedding dimensionality on variant discrimination capability. Our reference-free framework provides a scalable, interpretable approach to genomic surveillance with direct applications to public health monitoring.
Related papers
- C-GRASP: Clinically-Grounded Reasoning for Affective Signal Processing [0.0]
Heart rate variability (HRV) is a pivotal noninvasive marker for autonomic monitoring.<n>Applying Large Language Models (LLMs) to HRV interpretation is hindered by physiological hallucinations.<n>We propose C-GRASP, a guardrailed RAG-enhanced pipeline that decomposes HRV interpretation into eight traceable reasoning steps.
arXiv Detail & Related papers (2026-01-15T12:35:35Z) - AGNES: Adaptive Graph Neural Network and Dynamic Programming Hybrid Framework for Real-Time Nanopore Seed Chaining [0.0]
Nanopore sequencing enables real-time long-read DNA sequencing with reads exceeding 10 kilobases.<n>Inherent error rates of 12-15 percent present significant computational challenges for read alignment.<n>This paper presents RawHash3, a hybrid framework combining graph neural networks with classical dynamic programming for adaptive seed chaining.
arXiv Detail & Related papers (2025-10-15T08:05:43Z) - Hemorica: A Comprehensive CT Scan Dataset for Automated Brain Hemorrhage Classification, Segmentation, and Detection [0.749500254646884]
Hemorica is a publicly available collection of 372 head CT examinations acquired between 2012 and 2024.<n>Each scan has been exhaustively annotated for five ICH subtypes-epidural (EPH), subdural (SDH), subarachnoid (SAH), intraparenchymal (IPH)<n>Hemorica offers a unified, fine-grained benchmark that supports multi-task and curriculum learning.
arXiv Detail & Related papers (2025-09-26T23:09:41Z) - Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy [0.9999629695552196]
The present work develops and validates a data-driven and interpretable machine-learning framework designed to predict strokes.<n>Ten routinely gathered demographic, lifestyle, and clinical variables were sourced from a public cohort of 4,981 records.<n>The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model.
arXiv Detail & Related papers (2025-05-18T21:46:45Z) - A Feature-Level Ensemble Model for COVID-19 Identification in CXR Images using Choquet Integral and Differential Evolution Optimization [0.7510165488300369]
An effective strategy to mitigate the COVID-19 pandemic involves integrating testing to identify infected individuals.<n>While RT-PCR is considered the gold standard for diagnosing COVID-19, it has some limitations such as the risk of false negatives.<n>This paper introduces a novel Deep Learning Diagnosis System that integrates pre-trained Deep Conal Neural Networks (DCNNs) within an ensemble learning framework.
arXiv Detail & Related papers (2025-01-14T16:28:02Z) - Large-Scale Targeted Cause Discovery via Learning from Simulated Data [66.51307552703685]
We propose a novel machine learning approach for inferring causal variables of a target variable from observations.<n>We train a neural network using supervised learning on simulated data to infer causality.<n> Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks.
arXiv Detail & Related papers (2024-08-29T02:21:11Z) - A Conditional Flow Variational Autoencoder for Controllable Synthesis of
Virtual Populations of Anatomy [76.20367415712867]
We propose a conditional variational autoencoder (cVAE) with normalising flows to boost the flexibility and complexity of the approximate posterior learnt.
We demonstrate the performance of our conditional flow VAE using a data set of cardiac left ventricles acquired from 2360 patients.
arXiv Detail & Related papers (2023-06-26T13:23:52Z) - G-DetKD: Towards General Distillation Framework for Object Detectors via
Contrastive and Semantic-guided Feature Imitation [49.421099172544196]
We propose a novel semantic-guided feature imitation technique, which automatically performs soft matching between feature pairs across all pyramid levels.
We also introduce contrastive distillation to effectively capture the information encoded in the relationship between different feature regions.
Our method consistently outperforms the existing detection KD techniques, and works when (1) components in the framework are used separately and in conjunction.
arXiv Detail & Related papers (2021-08-17T07:44:27Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - CovidDeep: SARS-CoV-2/COVID-19 Test Based on Wearable Medical Sensors
and Efficient Neural Networks [51.589769497681175]
The novel coronavirus (SARS-CoV-2) has led to a pandemic.
The current testing regime based on Reverse Transcription-Polymerase Chain Reaction for SARS-CoV-2 has been unable to keep up with testing demands.
We propose a framework called CovidDeep that combines efficient DNNs with commercially available WMSs for pervasive testing of the virus.
arXiv Detail & Related papers (2020-07-20T21:47:28Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.