Related papers: GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis

GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis

URL: http://arxiv.org/abs/2406.15341v3
Date: Tue, 08 Apr 2025 17:09:04 GMT
Title: GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis
Authors: Haoyang Liu, Shuyu Chen, Ye Zhang, Haohan Wang,
Abstract summary: We introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data.<n>GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems.<n>We present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction.
Score: 19.78030916589553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data. GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction, to collaboratively analyze gene expression datasets. Our experiments demonstrate the potential of LLM-based methods in analyzing genomic data, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing automated methods for gene expression data analysis. The benchmark is available at https://github.com/Liu-Hy/GenoTEX.

Related papers

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis [12.311957227670598]
GenoMAS orchestrates six specialized agents through typed message-passing protocols.<n>At the heart of GenoMAS lies a guided-planning framework.<n>GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature.
arXiv Detail & Related papers (2025-07-28T17:55:08Z)
Agentomics-ML: Autonomous Machine Learning Experimentation Agent for Genomic and Transcriptomic Data [33.7054351451505]
We introduce Agentomics-ML, a fully autonomous agent-based system designed to produce a classification model.<n>We show that Agentomics-ML outperforms existing state-of-the-art agent-based methods in both generalization and success rates.
arXiv Detail & Related papers (2025-06-05T19:44:38Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis [35.61361183175167]
Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for biological research. However, manual manipulation of various tools to achieve desired outcomes can be labor-intensive for researchers. We introduce CellAgent, an LLM-driven multi-agent framework for the automatic processing and execution of scRNA-seq data analysis tasks.
arXiv Detail & Related papers (2024-07-13T09:14:50Z)
DataGen: Unified Synthetic Dataset Generation via Large Language Models [88.16197692794707]
DataGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.<n>To augment data diversity, DataGen incorporates an attribute-guided generation module and a group checking feature.<n>Extensive experiments demonstrate the superior quality of data generated by DataGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z)
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z)
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model. It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights. We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs. Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z)
Toward a Team of AI-made Scientists for Scientific Discovery from Gene Expression Data [9.767546641019862]
We introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline. TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM) These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes.
arXiv Detail & Related papers (2024-02-15T06:30:12Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
A New Deep Learning and XAI-Based Algorithm for Features Selection in Genomics [5.787117733071415]
The paper proposes a novel algorithm to perform Feature Selection on genomic-scale data. Results of the application on a Chronic Lymphocytic Leukemia dataset evidence the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-03-29T16:44:13Z)
Natural language processing for clusterization of genes according to their functions [62.997667081978825]
We propose an approach that reduces the analysis of several thousand genes to analysis of several clusters. The descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches.
arXiv Detail & Related papers (2022-07-17T12:59:34Z)
Automated Materials Spectroscopy Analysis using Genetic Algorithms [12.447537764798795]
Genetic Algorithm (GA) based, open-source project to solve multi-objective optimization problems of materials characterization data analysis. modular design and multiple crossover and mutation options make the software for additional materials characterization applications too.
arXiv Detail & Related papers (2022-03-18T20:36:31Z)
TRAPDOOR: Repurposing backdoors to detect dataset bias in machine learning-based genomic analysis [15.483078145498085]
Under-representation of groups in datasets can lead to inaccurate predictions for certain groups, which can exacerbate systemic discrimination issues. We propose TRAPDOOR, a methodology for identification of biased datasets by repurposing a technique that has been mostly proposed for nefarious purposes: Neural network backdoors. Using a real-world cancer dataset, we analyze the dataset with the bias that already existed towards white individuals and also introduced biases in datasets artificially.
arXiv Detail & Related papers (2021-08-14T17:02:02Z)
Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z)
Using ontology embeddings for structural inductive bias in gene expression data analysis [6.587739898387445]
Stratifying cancer patients based on their gene expression levels allows improving diagnosis, survival analysis and treatment planning. We propose to incorporate biological knowledge about genes into the machine learning system for the task of patient classification given their gene expression data.
arXiv Detail & Related papers (2020-11-22T12:13:29Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.