Whole-Genome Phenotype Prediction with Machine Learning: Open Problems   in Bacterial Genomics
        - URL: http://arxiv.org/abs/2502.07749v1
 - Date: Tue, 11 Feb 2025 18:25:14 GMT
 - Title: Whole-Genome Phenotype Prediction with Machine Learning: Open Problems   in Bacterial Genomics
 - Authors: Tamsin James, Ben Williamson, Peter Tino, Nicole Wheeler, 
 - Abstract summary: We set up problems surrounding phenotype prediction from bacterial whole-genome datasets and extend those to learning causal effects.<n>We discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.
 - Score: 0.8437187555622164
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   How can we identify causal genetic mechanisms that govern bacterial traits? Initial efforts entrusting machine learning models to handle the task of predicting phenotype from genotype return high accuracy scores. However, attempts to extract any meaning from the predictive models are found to be corrupted by falsely identified "causal" features. Relying solely on pattern recognition and correlations is unreliable, significantly so in bacterial genomics settings where high-dimensionality and spurious associations are the norm. Though it is not yet clear whether we can overcome this hurdle, significant efforts are being made towards discovering potential high-risk bacterial genetic variants. In view of this, we set up open problems surrounding phenotype prediction from bacterial whole-genome datasets and extending those to learning causal effects, and discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature. 
 
       
      
        Related papers
        - Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual   Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv  Detail & Related papers  (2025-06-26T09:05:38Z) - Inferring genotype-phenotype maps using attention models [0.21990652930491852]
Predicting phenotype from genotype is a central challenge in genetics.
Recent advances in machine learning, particularly attention-based models, offer a promising alternative.
Here, we apply attention-based models to quantitative genetics.
arXiv  Detail & Related papers  (2025-04-14T16:32:17Z) - G2PDiffusion: Genotype-to-Phenotype Prediction with Diffusion Models [108.94237816552024]
This paper introduces G2PDiffusion, the first-of-its-kind diffusion model designed for genotype-to-phenotype generation across multiple species.<n>We use images to represent morphological phenotypes across species and redefine phenotype prediction as conditional image generation.
arXiv  Detail & Related papers  (2025-02-07T06:16:31Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation   Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv  Detail & Related papers  (2024-05-27T19:57:17Z) - Using Pre-training and Interaction Modeling for ancestry-specific   disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv  Detail & Related papers  (2024-04-26T16:39:50Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
  Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv  Detail & Related papers  (2024-03-02T00:56:05Z) - Predicting loss-of-function impact of genetic mutations: a machine
  learning approach [0.0]
This paper aims to train machine learning models on the attributes of a genetic mutation to predict LoFtool scores.
These attributes included, but were not limited to, the position of a mutation on a chromosome, changes in amino acids, and changes in codons caused by the mutation.
Models were evaluated using five-fold cross-validated averages of r-squared, mean squared error, root mean squared error, mean absolute error, and explained variance.
arXiv  Detail & Related papers  (2024-01-26T19:27:38Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv  Detail & Related papers  (2023-10-23T13:35:24Z) - Genetic prediction of quantitative traits: a machine learner's guide
  focused on height [0.0]
We provide an overview for the machine learning community on current state of the art models and associated subtleties.
We use height as an example of a continuous-valued phenotype and provide an introduction to benchmark datasets, confounders, feature selection, and common metrics.
arXiv  Detail & Related papers  (2023-10-06T05:43:50Z) - Human Limits in Machine Learning: Prediction of Plant Phenotypes Using
  Soil Microbiome Data [0.2812395851874055]
We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes.
We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models.
arXiv  Detail & Related papers  (2023-06-19T20:52:37Z) - CausalBench: A Large-scale Benchmark for Network Inference from
  Single-cell Perturbation Data [61.088705993848606]
We introduce CausalBench, a benchmark suite for evaluating causal inference methods on real-world interventional data.
CaulBench incorporates biologically-motivated performance metrics, including new distribution-based interventional metrics.
arXiv  Detail & Related papers  (2022-10-31T13:04:07Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv  Detail & Related papers  (2020-10-14T02:36:27Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
  Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv  Detail & Related papers  (2020-09-02T02:50:30Z) - Handling highly correlated genes in prediction analysis of genomic
  studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv  Detail & Related papers  (2020-07-05T22:14:03Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.