Related papers: Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data

Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data

URL: http://arxiv.org/abs/2306.11157v2
Date: Sat, 17 Feb 2024 03:03:59 GMT
Title: Human Limits in Machine Learning: Prediction of Plant Phenotypes Using Soil Microbiome Data
Authors: Rosa Aghdam, Xudong Tang, Shan Shan, Richard Lankau, Claudia Sol\'is-Lemus
Abstract summary: We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models.
Score: 0.2812395851874055
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The preservation of soil health is a critical challenge in the 21st century due to its significant impact on agriculture, human health, and biodiversity. We provide the first deep investigation of the predictive potential of machine learning models to understand the connections between soil and biological phenotypes. We investigate an integrative framework performing accurate machine learning-based prediction of plant phenotypes from biological, chemical, and physical properties of the soil via two models: random forest and Bayesian neural network. We show that prediction is improved when incorporating environmental features like soil physicochemical properties and microbial population density into the models, in addition to the microbiome information. Exploring various data preprocessing strategies confirms the significant impact of human decisions on predictive performance. We show that the naive total sum scaling normalization that is commonly used in microbiome research is not the optimal strategy to maximize predictive power. Also, we find that accurately defined labels are more important than normalization, taxonomic level or model characteristics. In cases where humans are unable to classify samples accurately, machine learning model performance is limited. Lastly, we provide domain scientists via a full model selection decision tree to identify the human choices that optimize model prediction power. Our work is accompanied by open source reproducible scripts (https://github.com/solislemuslab/soil-microbiome-nn) for maximum outreach among the microbiome research community.

Related papers

Calibrating Biophysical Models for Grape Phenology Prediction via Multi-Task Learning [5.796482272333648]
We propose a hybrid modeling approach that combines multi-task learning with a recurrent neural network to parameterize a differentiable biophysical model.<n>By using multi-task learning to predict the parameters of the biophysical model, our approach enables shared learning across cultivars while preserving biological structure.
arXiv Detail & Related papers (2025-08-05T20:36:11Z)
Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics [0.8437187555622164]
We set up problems surrounding phenotype prediction from bacterial whole-genome datasets and extend those to learning causal effects. We discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.
arXiv Detail & Related papers (2025-02-11T18:25:14Z)
Causal Representation Learning from Multimodal Biomedical Observations [57.00712157758845]
We develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets. Key theoretical contribution is the structural sparsity of causal connections between modalities. Results on a real-world human phenotype dataset are consistent with established biomedical research.
arXiv Detail & Related papers (2024-11-10T16:40:27Z)
Stacked ensemble\-based mutagenicity prediction model using multiple modalities with graph attention network [0.9736758288065405]
Mutagenicity is a concern due to its association with genetic mutations which can result in a variety of negative consequences. In this work, we introduce a novel stacked ensemble based mutagenicity prediction model.
arXiv Detail & Related papers (2024-09-03T09:14:21Z)
Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold [83.18058549195855]
We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depends on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrating along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations.
arXiv Detail & Related papers (2024-08-26T20:05:31Z)
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model. It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z)
Smoke and Mirrors in Causal Downstream Tasks [59.90654397037007]
This paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations. We compare 6 480 models fine-tuned from state-of-the-art visual backbones, and find that the sampling and modeling choices significantly affect the accuracy of the causal estimate. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones.
arXiv Detail & Related papers (2024-05-27T13:26:34Z)
Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity [3.972930262155919]
We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats.
arXiv Detail & Related papers (2024-05-09T09:34:51Z)
Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space. A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z)
Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes [72.13373216644021]
We study the societal impact of machine learning by considering the collection of models that are deployed in a given context. We find deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.
arXiv Detail & Related papers (2023-07-12T01:11:52Z)
Application of data engineering approaches to address challenges in microbiome data for optimal medical decision-making [0.0]
The study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine. The prototype employed in the study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine.
arXiv Detail & Related papers (2023-06-30T05:36:39Z)
Adaptive Transfer Learning for Plant Phenotyping [33.28898554551106]
We study the knowledge transferability of modern machine learning models in plant phenotyping. How is the performance of conventional machine learning models affected by the number of annotated samples for plant phenotyping? Could the neural network based transfer learning models improve the performance of plant phenotyping?
arXiv Detail & Related papers (2022-01-14T00:40:40Z)
Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z)
Towards an Automatic Analysis of CHO-K1 Suspension Growth in Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data. Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.