GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis
- URL: http://arxiv.org/abs/2406.15341v3
- Date: Tue, 08 Apr 2025 17:09:04 GMT
- Title: GenoTEX: An LLM Agent Benchmark for Automated Gene Expression Data Analysis
- Authors: Haoyang Liu, Shuyu Chen, Ye Zhang, Haohan Wang,
- Abstract summary: We introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data.<n>GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems.<n>We present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction.
- Score: 19.78030916589553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automated analysis of gene expression data. GenoTEX provides analysis code and results for solving a wide range of gene-trait association problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgent, a team of LLM-based agents that adopt a multi-step programming workflow with flexible self-correction, to collaboratively analyze gene expression datasets. Our experiments demonstrate the potential of LLM-based methods in analyzing genomic data, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing automated methods for gene expression data analysis. The benchmark is available at https://github.com/Liu-Hy/GenoTEX.
Related papers
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - CellAgent: An LLM-driven Multi-Agent Framework for Automated Single-cell Data Analysis [35.61361183175167]
Single-cell RNA sequencing (scRNA-seq) data analysis is crucial for biological research.
However, manual manipulation of various tools to achieve desired outcomes can be labor-intensive for researchers.
We introduce CellAgent, an LLM-driven multi-agent framework for the automatic processing and execution of scRNA-seq data analysis tasks.
arXiv Detail & Related papers (2024-07-13T09:14:50Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Toward a Team of AI-made Scientists for Scientific Discovery from Gene
Expression Data [9.767546641019862]
We introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline.
TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM)
These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes.
arXiv Detail & Related papers (2024-02-15T06:30:12Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - A New Deep Learning and XAI-Based Algorithm for Features Selection in
Genomics [5.787117733071415]
The paper proposes a novel algorithm to perform Feature Selection on genomic-scale data.
Results of the application on a Chronic Lymphocytic Leukemia dataset evidence the effectiveness of the algorithm.
arXiv Detail & Related papers (2023-03-29T16:44:13Z) - Natural language processing for clusterization of genes according to
their functions [62.997667081978825]
We propose an approach that reduces the analysis of several thousand genes to analysis of several clusters.
The descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches.
arXiv Detail & Related papers (2022-07-17T12:59:34Z) - Automated Materials Spectroscopy Analysis using Genetic Algorithms [12.447537764798795]
Genetic Algorithm (GA) based, open-source project to solve multi-objective optimization problems of materials characterization data analysis.
modular design and multiple crossover and mutation options make the software for additional materials characterization applications too.
arXiv Detail & Related papers (2022-03-18T20:36:31Z) - TRAPDOOR: Repurposing backdoors to detect dataset bias in machine
learning-based genomic analysis [15.483078145498085]
Under-representation of groups in datasets can lead to inaccurate predictions for certain groups, which can exacerbate systemic discrimination issues.
We propose TRAPDOOR, a methodology for identification of biased datasets by repurposing a technique that has been mostly proposed for nefarious purposes: Neural network backdoors.
Using a real-world cancer dataset, we analyze the dataset with the bias that already existed towards white individuals and also introduced biases in datasets artificially.
arXiv Detail & Related papers (2021-08-14T17:02:02Z) - Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling.
We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z) - Using ontology embeddings for structural inductive bias in gene
expression data analysis [6.587739898387445]
Stratifying cancer patients based on their gene expression levels allows improving diagnosis, survival analysis and treatment planning.
We propose to incorporate biological knowledge about genes into the machine learning system for the task of patient classification given their gene expression data.
arXiv Detail & Related papers (2020-11-22T12:13:29Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.