Assessing putative bias in prediction of anti-microbial resistance from
real-world genotyping data under explicit causal assumptions
- URL: http://arxiv.org/abs/2107.03383v1
- Date: Tue, 6 Jul 2021 21:19:21 GMT
- Title: Assessing putative bias in prediction of anti-microbial resistance from
real-world genotyping data under explicit causal assumptions
- Authors: Mattia Prosperi, Simone Marini, Christina Boucher, Jiang Bian
- Abstract summary: Development of AMR prediction tools can be biased, since sampling is non-randomized.
We evaluate the effectiveness of propensity-based rebalancing and confounding adjustment on AMR prediction using genotype-phenotype AMR data.
- Score: 3.795323061432507
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Whole genome sequencing (WGS) is quickly becoming the customary means for
identification of antimicrobial resistance (AMR) due to its ability to obtain
high resolution information about the genes and mechanisms that are causing
resistance and driving pathogen mobility. By contrast, traditional phenotypic
(antibiogram) testing cannot easily elucidate such information. Yet development
of AMR prediction tools from genotype-phenotype data can be biased, since
sampling is non-randomized. Sample provenience, period of collection, and
species representation can confound the association of genetic traits with AMR.
Thus, prediction models can perform poorly on new data with sampling
distribution shifts. In this work -- under an explicit set of causal
assumptions -- we evaluate the effectiveness of propensity-based rebalancing
and confounding adjustment on AMR prediction using genotype-phenotype AMR data
from the Pathosystems Resource Integration Center (PATRIC). We select bacterial
genotypes (encoded as k-mer signatures, i.e. DNA fragments of length k),
country, year, species, and AMR phenotypes for the tetracycline drug class,
preparing test data with recent genomes coming from a single country. We test
boosted logistic regression (BLR) and random forests (RF) with/without
bias-handling. On 10,936 instances, we find evidence of species, location and
year imbalance with respect to the AMR phenotype. The crude versus
bias-adjusted change in effect of genetic signatures on AMR varies but only
moderately (selecting the top 20,000 out of 40+ million k-mers). The area under
the receiver operating characteristic (AUROC) of the RF (0.95) is comparable to
that of BLR (0.94) on both out-of-bag samples from bootstrap and the external
test (n=1,085), where AUROCs do not decrease. We observe a 1%-5% gain in AUROC
with bias-handling compared to the sole use of genetic signatures. ...
Related papers
- Optimizing Gene-Based Testing for Antibiotic Resistance Prediction [11.1340971514584]
Antibiotic Resistance (AR) is a critical global health challenge that necessitates the development of cost-effective, efficient, and accurate diagnostic tools.
Given the genetic basis of AR, techniques such as Polymerase Chain Reaction (PCR) that target specific resistance genes offer a promising approach for predictive diagnostics.
This study introduces GenoARM, a novel framework that integrates reinforcement learning (RL) with transformer-based models to optimize the selection of PCR gene tests and improve AR predictions.
arXiv Detail & Related papers (2025-02-19T14:34:03Z) - Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics [0.8437187555622164]
We set up problems surrounding phenotype prediction from bacterial whole-genome datasets and extend those to learning causal effects.
We discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.
arXiv Detail & Related papers (2025-02-11T18:25:14Z) - Survey and Improvement Strategies for Gene Prioritization with Large Language Models [61.24568051916653]
Large language models (LLMs) have performed well in medical exams, but their effectiveness in diagnosing rare genetic diseases has not been assessed.
We used multi-agent and Human Phenotype Ontology (HPO) classification to categorized patients based on phenotypes and solvability levels.
At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly.
arXiv Detail & Related papers (2025-01-30T23:03:03Z) - CSGDN: Contrastive Signed Graph Diffusion Network for Predicting Crop Gene-phenotype Associations [6.5678927417916455]
We propose a Contrastive Signed Graph Diffusion Network, CSGDN, to learn robust node representations with fewer training samples to achieve higher link prediction accuracy.
We conduct experiments to validate the performance of CSGDN on three crop datasets: Gossypium hirsutum, Brassica napus, and Triticum turgidum.
arXiv Detail & Related papers (2024-10-10T01:01:10Z) - Regressor-free Molecule Generation to Support Drug Response Prediction [83.25894107956735]
Conditional generation based on the target IC50 score can obtain a more effective sampling space.
Regressor-free guidance combines a diffusion model's score estimation with a regression controller model's gradient based on number labels.
arXiv Detail & Related papers (2024-05-23T13:22:17Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Predicting loss-of-function impact of genetic mutations: a machine
learning approach [0.0]
This paper aims to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores.
These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation.
Models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance.
arXiv Detail & Related papers (2024-01-26T19:27:38Z) - Isoform Function Prediction Using a Deep Neural Network [9.507435239304591]
Studies have shown that more than 95% of human multi-exon genes have undergone alternative splicing.
Alternative splicing plays a significant role in human health and disease.
This project uses all Conditional data and valuable information such as mRNA sequences, expression profiles, and gene graphs.
arXiv Detail & Related papers (2022-08-05T09:31:25Z) - rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z) - Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak.
Standard methods struggle to accommodate the partial observability and sparse data common at finer scales.
We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.