StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis
- URL: http://arxiv.org/abs/2505.22746v1
- Date: Wed, 28 May 2025 18:05:15 GMT
- Title: StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis
- Authors: Jose Guadalupe Hernandez, Attri Ghosh, Philip J. Freda, Yufei Meng, Nicholas Matsumoto, Jason H. Moore,
- Abstract summary: We present the Star-Based Automated Single-locus and Epistasis analysis tool - Genetic Programming (StarBASE-GP)<n>StarBASE-GP is an automated framework for discovering meaningful genetic variants associated with phenotypic variation in large-scale genomic datasets.<n>We evaluate StarBASE-GP on a cohort of Rattus norvegicus (brown rat) to identify variants associated with body mass index.
- Score: 1.6393663206537612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the Star-Based Automated Single-locus and Epistasis analysis tool - Genetic Programming (StarBASE-GP), an automated framework for discovering meaningful genetic variants associated with phenotypic variation in large-scale genomic datasets. StarBASE-GP uses a genetic programming-based multi-objective optimization strategy to evolve machine learning pipelines that simultaneously maximize explanatory power (r2) and minimize pipeline complexity. Biological domain knowledge is integrated at multiple stages, including the use of nine inheritance encoding strategies to model deviations from additivity, a custom linkage disequilibrium pruning node that minimizes redundancy among features, and a dynamic variant recommendation system that prioritizes informative candidates for pipeline inclusion. We evaluate StarBASE-GP on a cohort of Rattus norvegicus (brown rat) to identify variants associated with body mass index, benchmarking its performance against a random baseline and a biologically naive version of the tool. StarBASE-GP consistently evolves Pareto fronts with superior performance, yielding higher accuracy in identifying both ground truth and novel quantitative trait loci, highlighting relevant targets for future validation. By incorporating evolutionary search and relevant biological theory into a flexible automated machine learning framework, StarBASE-GP demonstrates robust potential for advancing variant discovery in complex traits.
Related papers
- PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs [80.08310253195144]
PRING is the first benchmark that evaluates protein-protein interaction prediction from a graph-level perspective.<n> PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions.
arXiv Detail & Related papers (2025-07-07T15:21:05Z) - OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking [21.177773831820673]
Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome.<n>As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation.<n>We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs.
arXiv Detail & Related papers (2025-05-20T14:16:25Z) - Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity [0.39945675027960637]
We introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling.<n>GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines.<n>We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness.
arXiv Detail & Related papers (2025-04-22T20:34:47Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Network-based Neighborhood regression [0.0]
Current statistical analysis on biological modules focuses on either detecting the functional modules in biological networks or sub-group regression on the biological features without using the network data.<n>This paper proposes a novel network-based neighborhood regression framework whose regression functions depend on both the global community-level information and local connectivity structures among entities.
arXiv Detail & Related papers (2024-07-04T18:08:40Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Weighted Ensembles for Active Learning with Adaptivity [60.84896785303314]
This paper presents an ensemble of GP models with weights adapted to the labeled data collected incrementally.
Building on this novel EGP model, a suite of acquisition functions emerges based on the uncertainty and disagreement rules.
An adaptively weighted ensemble of EGP-based acquisition functions is also introduced to further robustify performance.
arXiv Detail & Related papers (2022-06-10T11:48:49Z) - Incremental Ensemble Gaussian Processes [53.3291389385672]
We propose an incremental ensemble (IE-) GP framework, where an EGP meta-learner employs an it ensemble of GP learners, each having a unique kernel belonging to a prescribed kernel dictionary.
With each GP expert leveraging the random feature-based approximation to perform online prediction and model update with it scalability, the EGP meta-learner capitalizes on data-adaptive weights to synthesize the per-expert predictions.
The novel IE-GP is generalized to accommodate time-varying functions by modeling structured dynamics at the EGP meta-learner and within each GP learner.
arXiv Detail & Related papers (2021-10-13T15:11:25Z) - Complexity-based speciation and genotype representation for
neuroevolution [81.21462458089142]
This paper introduces a speciation principle for neuroevolution where evolving networks are grouped into species based on the number of hidden neurons.
The proposed speciation principle is employed in several techniques designed to promote and preserve diversity within species and in the ecosystem as a whole.
arXiv Detail & Related papers (2020-10-11T06:26:56Z) - EPGAT: Gene Essentiality Prediction With Graph Attention Networks [1.1602089225841632]
We propose EPGAT, an approach for essentiality prediction based on Graph Attention Networks (GATs)
Our model directly learns patterns of gene essentiality from PPI networks, integrating additional evidence from multiomics data encoded as node attributes.
We benchmarked EPGAT for four organisms, including humans, accurately predicting gene essentiality with AUC score ranging from 0.78 to 0.97.
arXiv Detail & Related papers (2020-07-19T13:47:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.