Related papers: Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling

Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling

URL: http://arxiv.org/abs/2504.07065v1
Date: Wed, 09 Apr 2025 17:30:43 GMT
Title: Enhancing Downstream Analysis in Genome Sequencing: Species Classification While Basecalling
Authors: Riselda Kodra, Hadjer Benmeziane, Irem Boybat, William Andrew Simon,
Abstract summary: This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences.<n>We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers.<n>We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers.
Score: 1.0017486177151396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to quickly and accurately identify microbial species in a sample, known as metagenomic profiling, is critical across various fields, from healthcare to environmental science. This paper introduces a novel method to profile signals coming from sequencing devices in parallel with determining their nucleotide sequences, a process known as basecalling, via a multi-objective deep neural network for simultaneous basecalling and multi-class genome classification. We introduce a new loss strategy where losses for basecalling and classification are back-propagated separately, with model weights combined for the shared layers, and a pre-configured ranking strategy allowing top-K species accuracy, giving users flexibility to choose between higher accuracy or higher speed at identifying the species. We achieve state-of-the-art basecalling accuracies, while classification accuracies meet and exceed the results of state-of-the-art binary classifiers, attaining an average of 92.5%/98.9% accuracy at identifying the top-1/3 species among a total of 17 genomes in the Wick bacterial dataset. The work presented here has implications for future studies in metagenomic profiling by accelerating the bottleneck step of matching the DNA sequence to the correct genome.

Related papers

A Large-Scale Benchmark of Cross-Modal Learning for Histology and Gene Expression in Spatial Transcriptomics [2.3070195554676993]
HESCAPE is a benchmark for evaluating cross-modal contrastive pretraining in spatial transcriptomics.<n>Gene models pretrained on spatial transcriptomics data outperform both those trained without spatial data and simple baseline approaches.<n>We identify batch effects as a key factor that interferes with effective cross-modal alignment.
arXiv Detail & Related papers (2025-08-02T21:11:36Z)
Learning Genomic Structure from $k$-mers [2.07180164747172]
We present a method for analyzing read data using contrastive learning.<n>An encoder model is trained to produce embeddings that cluster together sequences from the same genomic region.<n>The model can also be trained fully self-supervised on read data, enabling analysis without the need to construct a full genome assembly.
arXiv Detail & Related papers (2025-05-22T13:46:18Z)
A Misclassification Network-Based Method for Comparative Genomic Analysis [3.7671415694914927]
Classifying genome sequences based on metadata has been an active area of research in comparative genomics for decades.<n>In this study, we integrate AI and network science approaches to develop a comparative genomic analysis framework.
arXiv Detail & Related papers (2024-12-09T23:22:15Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
DNA Sequence Classification with Compressors [0.0]
Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis. Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods.
arXiv Detail & Related papers (2024-01-25T09:17:19Z)
Single-Cell Deep Clustering Method Assisted by Exogenous Gene Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells. During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation. This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
Using Signal Processing in Tandem With Adapted Mixture Models for Classifying Genomic Signals [16.119729980200955]
We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence. Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy.
arXiv Detail & Related papers (2022-11-03T06:10:55Z)
CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data [61.088705993848606]
We introduce CausalBench, a benchmark suite for evaluating causal inference methods on real-world interventional data. CaulBench incorporates biologically-motivated performance metrics, including new distribution-based interventional metrics.
arXiv Detail & Related papers (2022-10-31T13:04:07Z)
Assigning Species Information to Corresponding Genes by a Sequence Labeling Framework [7.231921004060877]
Existing methods typically rely on rules based on gene and species co-occurrence in the article. We develop a high-performance method, using a novel deep learning-based framework, to classify whether there is a relation between a gene and a species. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method.
arXiv Detail & Related papers (2022-05-08T12:39:45Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers. Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm. Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z)
Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem. Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples. Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.