BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments
- URL: http://arxiv.org/abs/2405.17631v2
- Date: Sun, 06 Oct 2024 04:55:16 GMT
- Title: BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments
- Authors: Yusuf Roohani, Andrew Lee, Qian Huang, Jian Vora, Zachary Steinhart, Kexin Huang, Alexander Marson, Percy Liang, Jure Leskovec,
- Abstract summary: We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
- Score: 112.25067497985447
- License:
- Abstract: Agents based on large language models have shown great potential in accelerating scientific discovery by leveraging their rich background knowledge and reasoning capabilities. In this paper, we introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions. We demonstrate our agent on the problem of designing genetic perturbation experiments, where the aim is to find a small subset out of many possible genes that, when perturbed, result in a specific phenotype (e.g., cell growth). Utilizing its biological knowledge, BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model or explicitly design an acquisition function as in Bayesian optimization. Moreover, BioDiscoveryAgent, using Claude 3.5 Sonnet, achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets, and a 46% improvement in the harder task of non-essential gene perturbation, compared to existing Bayesian optimization baselines specifically trained for this task. Our evaluation includes one dataset that is unpublished, ensuring it is not part of the language model's training data. Additionally, BioDiscoveryAgent predicts gene combinations to perturb more than twice as accurately as a random baseline, a task so far not explored in the context of closed-loop experiment design. The agent also has access to tools for searching the biomedical literature, executing code to analyze biological datasets, and prompting another agent to critically evaluate its predictions. Overall, BioDiscoveryAgent is interpretable at every stage, representing an accessible new paradigm in the computational design of biological experiments with the potential to augment scientists' efficacy.
Related papers
- Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation [15.495976478018264]
Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction.
We construct a dataset of background-hypothesis pairs from biomedical literature, partitioned into training, seen, and unseen test sets.
We assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2024-07-12T02:55:13Z) - GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery using Domain Databases [5.831842925038342]
We present GeneAgent, a first-of-its-kind language agent featuring self-verification capability.
It autonomously interacts with various biological databases to improve accuracy and reduce hallucination occurrences.
Benchmarking on 1,106 gene sets from different sources, GeneAgent consistently outperforms standard GPT-4 by a significant margin.
arXiv Detail & Related papers (2024-05-25T12:35:15Z) - Empowering Biomedical Discovery with AI Agents [15.125735219811268]
We envision "AI scientists" as systems capable of skeptical learning and reasoning.
Biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets.
AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.
arXiv Detail & Related papers (2024-04-03T16:08:01Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - Toward a Team of AI-made Scientists for Scientific Discovery from Gene
Expression Data [9.767546641019862]
We introduce a novel framework, a Team of AI-made Scientists (TAIS), designed to streamline the scientific discovery pipeline.
TAIS comprises simulated roles, including a project manager, data engineer, and domain expert, each represented by a Large Language Model (LLM)
These roles collaborate to replicate the tasks typically performed by data scientists, with a specific focus on identifying disease-predictive genes.
arXiv Detail & Related papers (2024-02-15T06:30:12Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks [68.39821375903591]
Generalist AI holds the potential to address limitations due to its versatility in interpreting different data types.
Here, we propose BiomedGPT, the first open-source and lightweight vision-language foundation model.
arXiv Detail & Related papers (2023-05-26T17:14:43Z) - Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling.
We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.