Improving Diseases Predictions Utilizing External Bio-Banks
- URL: http://arxiv.org/abs/2504.00036v1
- Date: Sun, 30 Mar 2025 13:05:20 GMT
- Title: Improving Diseases Predictions Utilizing External Bio-Banks
- Authors: Hido Pinto, Eran Segal,
- Abstract summary: We demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations.<n>We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features.<n>The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors.
- Score: 1.9336815376402723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning has been successfully used in critical domains, such as medicine. However, extracting meaningful insights from biomedical data is often constrained by the lack of their available disease labels. In this research, we demonstrate how machine learning can be leveraged to enhance explainability and uncover biologically meaningful associations, even when predictive improvements in disease modeling are limited. We train LightGBM models from scratch on our dataset (10K) to impute metabolomics features and apply them to the UK Biobank (UKBB) for downstream analysis. The imputed metabolomics features are then used in survival analysis to assess their impact on disease-related risk factors. As a result, our approach successfully identified biologically relevant connections that were not previously known to the predictive models. Additionally, we applied a genome-wide association study (GWAS) on key metabolomics features, revealing a link between vascular dementia and smoking. Although being a well-established epidemiological relationship, this link was not embedded in the model's training data, which validated the method's ability to extract meaningful signals. Furthermore, by integrating survival models as inputs in the 10K data, we uncovered associations between metabolic substances and obesity, demonstrating the ability to infer disease risk for future patients without requiring direct outcome labels. These findings highlight the potential of leveraging external bio-banks to extract valuable biomedical insights, even in data-limited scenarios. Our results demonstrate that machine learning models trained on smaller datasets can still be used to uncover real biological associations when carefully integrated with survival analysis and genetic studies.
Related papers
- Identifying Critical Phases for Disease Onset with Sparse Haematological Biomarkers [0.0]
Clinical blood tests are an emerging molecular data source for large-scale biomedical research.<n>Traditional imputation approaches distort learning signals and bias predictions while lacking biological interpretability.<n>We propose a novel methodology using Graph Neural Additive Networks (GNAN) to model delta biomarker trajectories.
arXiv Detail & Related papers (2025-03-18T07:29:45Z) - Causal Representation Learning from Multimodal Biomedical Observations [57.00712157758845]
We develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets.<n>Key theoretical contribution is the structural sparsity of causal connections between modalities.<n>Results on a real-world human phenotype dataset are consistent with established biomedical research.
arXiv Detail & Related papers (2024-11-10T16:40:27Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Embracing assay heterogeneity with neural processes for markedly
improved bioactivity predictions [0.276240219662896]
Predicting the bioactivity of a ligand is one of the hardest and most important challenges in computer-aided drug discovery.
Despite years of data collection and curation efforts, bioactivity data remains sparse and heterogeneous.
We present a hierarchical meta-learning framework that exploits the information synergy across disparate assays.
arXiv Detail & Related papers (2023-08-17T16:26:58Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - Functional Integrative Bayesian Analysis of High-dimensional
Multiplatform Genomic Data [0.8029049649310213]
We propose a framework called Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data (fiBAG)
fiBAG allows simultaneous identification of upstream functional evidence of proteogenomic biomarkers.
We demonstrate the profitability of fiBAG via a pan-cancer analysis of 14 cancer types.
arXiv Detail & Related papers (2022-12-29T03:31:45Z) - Modelling Technical and Biological Effects in scRNA-seq data with
Scalable GPLVMs [6.708052194104378]
We extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets.
The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast variational inference.
arXiv Detail & Related papers (2022-09-14T15:25:15Z) - Temporal Positive-unlabeled Learning for Biomedical Hypothesis
Generation via Risk Estimation [46.852387038668695]
This paper aims to introduce the use of machine learning to the scientific process of hypothesis generation.
We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings.
Experiment results on real-world biomedical term relationship datasets and case study analyses on a COVID-19 dataset validate the effectiveness of the proposed model.
arXiv Detail & Related papers (2020-10-05T10:58:03Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.