Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN-XGBoost Ensemble
- URL: http://arxiv.org/abs/2509.23552v1
- Date: Sun, 28 Sep 2025 01:19:11 GMT
- Title: Fusing Sequence Motifs and Pan-Genomic Features: Antimicrobial Resistance Prediction using an Explainable Lightweight 1D CNN-XGBoost Ensemble
- Authors: Md. Saiful Bari Siddiqui, Nowshin Tarannum,
- Abstract summary: Antimicrobial Resistance (AMR) is a rapidly escalating global health crisis.<n>Standard machine learning models treat the genome as an unordered collection of features.<n>We propose AMR-EnsembleNet, an ensemble framework that combines sequence-based and feature-based learning.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Antimicrobial Resistance (AMR) is a rapidly escalating global health crisis. While genomic sequencing enables rapid prediction of resistance phenotypes, current computational methods have limitations. Standard machine learning models treat the genome as an unordered collection of features, ignoring the sequential context of Single Nucleotide Polymorphisms (SNPs). State-of-the-art sequence models like Transformers are often too data-hungry and computationally expensive for the moderately-sized datasets that are typical in this domain. To address these challenges, we propose AMR-EnsembleNet, an ensemble framework that synergistically combines sequence-based and feature-based learning. We developed a lightweight, custom 1D Convolutional Neural Network (CNN) to efficiently learn predictive sequence motifs from high-dimensional SNP data. This sequence-aware model was ensembled with an XGBoost model, a powerful gradient boosting system adept at capturing complex, non-local feature interactions. We trained and evaluated our framework on a benchmark dataset of 809 E. coli strains, predicting resistance across four antibiotics with varying class imbalance. Our 1D CNN-XGBoost ensemble consistently achieved top-tier performance across all the antibiotics, reaching a Matthews Correlation Coefficient (MCC) of 0.926 for Ciprofloxacin (CIP) and the highest Macro F1-score of 0.691 for the challenging Gentamicin (GEN) AMR prediction. We also show that our model consistently focuses on SNPs within well-known AMR genes like fusA and parC, confirming it learns the correct genetic signals for resistance. Our work demonstrates that fusing a sequence-aware 1D CNN with a feature-based XGBoost model creates a powerful ensemble, overcoming the limitations of using either an order-agnostic or a standalone sequence model.
Related papers
- dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning [16.347455063831067]
dnaHNet is a tokenizer-free autoregressive model that segments and models genomic sequences end-to-end.<n>Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency.<n>On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision.
arXiv Detail & Related papers (2026-02-11T07:48:21Z) - VAMP-Net: An Interpretable Multi-Path Framework of Genomic Permutation-Invariant Set Attention and Quality-Aware 1D-CNN for MTB Drug Resistance [0.0]
We present a novel Interpretable Variant-Aware Multi-Path Network (VAMP-Net)<n>Path-1 employs a Set Attention Transformer processing permutation-invariant variant sets to capture epistatic interactions between genomic loci.<n>Path-2 utilizes a 1D Confabal Network that analyzes Variant Call Format quality metrics to learn confidence scores.
arXiv Detail & Related papers (2025-12-25T21:28:54Z) - Bidirectional Representations Augmented Autoregressive Biological Sequence Generation:Application in De Novo Peptide Sequencing [51.12821379640881]
Autoregressive (AR) models offer holistic, bidirectional representations but face challenges with generative coherence and scalability.<n>We propose a hybrid framework enhancing AR generation by dynamically integrating rich contextual information from non-autoregressive mechanisms.<n>A novel cross-decoder attention module enables the AR decoder to iteratively query and integrate these bidirectional features.
arXiv Detail & Related papers (2025-10-09T12:52:55Z) - UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials [62.72989417755985]
We present UniGenX, a unified generative model for function in natural systems.<n>UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens.<n>It achieves state-of-the-art or competitive performance for the function-aware generation across domains.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - Predicting gene essentiality and drug response from perturbation screens in preclinical cancer models with LEAP: Layered Ensemble of Autoencoders and Predictors [0.0]
We introduce LEAP (Layered Ensemble of Autoencoders and Predictors), a novel ensemble framework to improve robustness and generalization.<n>LEAP consistently outperforms state-of-the-art approaches in predicting gene essentiality or drug responses in unseen cell lines, tissues and disease models.
arXiv Detail & Related papers (2025-02-21T18:12:36Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures [0.9674145073701153]
sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated.
sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification.
It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
arXiv Detail & Related papers (2024-05-06T06:46:11Z) - GraSSRep: Graph-Based Self-Supervised Learning for Repeat Detection in
Metagenomic Assembly [24.55141372357102]
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment.
GraSSRep is a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories.
GraSSRep combines sequencing features with pre-defined and learned graph features to achieve state-of-the-art performance in repeat detection.
arXiv Detail & Related papers (2024-02-14T18:26:58Z) - Gene-induced Multimodal Pre-training for Image-omic Classification [20.465959546613554]
This paper proposes a Gene-induced Multimodal Pre-training framework, which jointly incorporates genomics and Whole Slide Images (WSIs) for classification tasks.
Experimental results on the TCGA dataset show the superiority of our network architectures and our pre-training framework, achieving 99.47% in accuracy for image-omic classification.
arXiv Detail & Related papers (2023-09-06T04:30:15Z) - Continuous time recurrent neural networks: overview and application to
forecasting blood glucose in the intensive care unit [56.801856519460465]
Continuous time autoregressive recurrent neural networks (CTRNNs) are a deep learning model that account for irregular observations.
We demonstrate the application of these models to probabilistic forecasting of blood glucose in a critical care setting.
arXiv Detail & Related papers (2023-04-14T09:39:06Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Reprogramming Pretrained Language Models for Antibody Sequence Infilling [72.13295049594585]
Computational design of antibodies involves generating novel and diverse sequences, while maintaining structural consistency.
Recent deep learning models have shown impressive results, however the limited number of known antibody sequence/structure pairs frequently leads to degraded performance.
In our work we address this challenge by leveraging Model Reprogramming (MR), which repurposes pretrained models on a source language to adapt to the tasks that are in a different language and have scarce data.
arXiv Detail & Related papers (2022-10-05T20:44:55Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - COVID-19 Classification Using Staked Ensembles: A Comprehensive Analysis [0.0]
COVID-19, increasing with a massive mortality rate, led to the WHO declaring it as a pandemic.
It is crucial to perform efficient and fast diagnosis.
The reverse transcript polymerase chain reaction (RTPCR) test is conducted to detect the presence of SARS-CoV-2.
Instead chest CT (or Chest X-ray) can be used for a fast and accurate diagnosis.
arXiv Detail & Related papers (2020-10-07T07:43:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.