GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
- URL: http://arxiv.org/abs/2507.21035v2
- Date: Thu, 31 Jul 2025 17:57:18 GMT
- Title: GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
- Authors: Haoyang Liu, Yijiang Li, Haohan Wang,
- Abstract summary: GenoMAS orchestrates six specialized agents through typed message-passing protocols.<n>At the heart of GenoMAS lies a guided-planning framework.<n>GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature.
- Score: 12.311957227670598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.
Related papers
- GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Comprehensive Metapath-based Heterogeneous Graph Transformer for Gene-Disease Association Prediction [19.803593399456823]
COmprehensive MEtapath-based heterogeneous graph Transformer(COMET) for predicting gene-disease associations.<n>Our method demonstrates superior robustness compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-14T09:41:18Z) - GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis [19.78030916589553]
We introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data.<n>GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems.<n>We present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction.
arXiv Detail & Related papers (2024-06-21T17:55:24Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases [5.831842925038342]
We present GeneAgent, a first-of-its-kind language agent featuring self-verification capability.
It autonomously interacts with various biological databases to improve accuracy and reduce hallucination occurrences.
Benchmarking on 1,106 gene sets from different sources, GeneAgent consistently outperforms standard GPT-4 by a significant margin.
arXiv Detail & Related papers (2024-05-25T12:35:15Z) - FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics [46.189419603576084]
FGBERT is a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware tokenizer.<n>It demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels.
arXiv Detail & Related papers (2024-02-24T13:13:17Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Cancer-inspired Genomics Mapper Model for the Generation of Synthetic
DNA Sequences with Desired Genomics Signatures [0.0]
Cancer-inspired genomics mapper model (CGMM) combines genetic algorithm (GA) and deep learning (DL) methods.
We demonstrate that CGMM can generate synthetic genomes of selected phenotypes such as ancestry and cancer.
arXiv Detail & Related papers (2023-05-01T07:16:40Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Graph Representation Learning on Tissue-Specific Multi-Omics [0.0]
We leverage a graph embedding model (i.e VGAE) to perform link prediction on tissue-specific Gene-Gene Interaction (GGI) networks.
We prove that the combination of multiple biological modalities (i.e multi-omics) leads to powerful embeddings and better link prediction performances.
arXiv Detail & Related papers (2021-07-25T17:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.