OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking
- URL: http://arxiv.org/abs/2505.14402v1
- Date: Tue, 20 May 2025 14:16:25 GMT
- Title: OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking
- Authors: Heng Yang, Jack Cole, Yuan Li, Renzhi Chen, Geyong Min, Ke Li,
- Abstract summary: Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome.<n>As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation.<n>We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs.
- Score: 21.177773831820673
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling.
Related papers
- Benchmarking and Evaluation of AI Models in Biology: Outcomes and Recommendations from the CZI Virtual Cells Workshop [18.00029758641004]
We aim to accelerate the development of robust benchmarks for AI driven Virtual Cells.<n>These benchmarks are crucial for ensuring rigor, relevance, and biological relevance.<n>These benchmarks will advance the field toward integrated models that drive new discoveries, therapeutic insights, and a deeper understanding of cellular systems.
arXiv Detail & Related papers (2025-07-14T17:25:28Z) - StarBASE-GP: Biologically-Guided Automated Machine Learning for Genotype-to-Phenotype Association Analysis [1.6393663206537612]
We present the Star-Based Automated Single-locus and Epistasis analysis tool - Genetic Programming (StarBASE-GP)<n>StarBASE-GP is an automated framework for discovering meaningful genetic variants associated with phenotypic variation in large-scale genomic datasets.<n>We evaluate StarBASE-GP on a cohort of Rattus norvegicus (brown rat) to identify variants associated with body mass index.
arXiv Detail & Related papers (2025-05-28T18:05:15Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - GAMformer: In-Context Learning for Generalized Additive Models [53.08263343627232]
We introduce GAMformer, the first method to leverage in-context learning to estimate shape functions of a GAM in a single forward pass.
Our experiments show that GAMformer performs on par with other leading GAMs across various classification benchmarks.
arXiv Detail & Related papers (2024-10-06T17:28:20Z) - OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models [6.781852451887055]
We introduce GFMBench, a framework dedicated to genomic foundation models (GFMs) benchmarking.
It integrates millions of genomic sequences across hundreds of genomic tasks from four large-scale benchmarks.
GFMBench is released as open-source software, offering user-friendly interfaces and diverse tutorials.
arXiv Detail & Related papers (2024-10-02T17:40:44Z) - UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models [88.16197692794707]
UniGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.
To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature.
Extensive experiments demonstrate the superior quality of data generated by UniGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Neuro-GPT: Towards A Foundation Model for EEG [0.04188114563181615]
We propose Neuro-GPT, a foundation model consisting of an EEG encoder and a GPT model.
Foundation model is pre-trained on a large-scale data set using a self-supervised task that learns how to reconstruct masked EEG segments.
Experiments demonstrate that applying a foundation model can significantly improve classification performance compared to a model trained from scratch.
arXiv Detail & Related papers (2023-11-07T07:07:18Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Continual Learning with Fully Probabilistic Models [70.3497683558609]
We present an approach for continual learning based on fully probabilistic (or generative) models of machine learning.
We propose a pseudo-rehearsal approach using a Gaussian Mixture Model (GMM) instance for both generator and classifier functionalities.
We show that GMR achieves state-of-the-art performance on common class-incremental learning problems at very competitive time and memory complexity.
arXiv Detail & Related papers (2021-04-19T12:26:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.