GeneDisco: A Benchmark for Experimental Design in Drug Discovery
- URL: http://arxiv.org/abs/2110.11875v1
- Date: Fri, 22 Oct 2021 16:01:39 GMT
- Title: GeneDisco: A Benchmark for Experimental Design in Drug Discovery
- Authors: Arash Mehrjou, Ashkan Soleymani, Andrew Jesson, Pascal Notin, Yarin
Gal, Stefan Bauer, Patrick Schwab
- Abstract summary: In vitro cellular experimentation with genetic interventions is an essential step in early-stage drug discovery.
GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.
- Score: 41.6425999218259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In vitro cellular experimentation with genetic interventions, using for
example CRISPR technologies, is an essential step in early-stage drug discovery
and target validation that serves to assess initial hypotheses about causal
associations between biological mechanisms and disease pathologies. With
billions of potential hypotheses to test, the experimental design space for in
vitro genetic experiments is extremely vast, and the available experimental
capacity - even at the largest research institutions in the world - pales in
relation to the size of this biological hypothesis space. Machine learning
methods, such as active and reinforcement learning, could aid in optimally
exploring the vast biological space by integrating prior knowledge from various
information sources as well as extrapolating to yet unexplored areas of the
experimental design space based on available data. However, there exist no
standardised benchmarks and data sets for this challenging task and little
research has been conducted in this area to date. Here, we introduce GeneDisco,
a benchmark suite for evaluating active learning algorithms for experimental
design in drug discovery. GeneDisco contains a curated set of multiple publicly
available experimental data sets as well as open-source implementations of
state-of-the-art active learning policies for experimental design and
exploration.
Related papers
- Causal Representation Learning from Multimodal Biological Observations [57.00712157758845]
We aim to develop flexible identification conditions for multimodal data.
We establish identifiability guarantees for each latent component, extending the subspace identification results from prior work.
Our key theoretical ingredient is the structural sparsity of the causal connections among distinct modalities.
arXiv Detail & Related papers (2024-11-10T16:40:27Z) - Large Language Models as Biomedical Hypothesis Generators: A Comprehensive Evaluation [15.495976478018264]
Large language models (LLMs) have emerged as a promising tool to revolutionize knowledge interaction.
We construct a dataset of background-hypothesis pairs from biomedical literature, partitioned into training, seen, and unseen test sets.
We assess the hypothesis generation capabilities of top-tier instructed models in zero-shot, few-shot, and fine-tuning settings.
arXiv Detail & Related papers (2024-07-12T02:55:13Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - A large dataset curation and benchmark for drug target interaction [0.7699646945563469]
Bioactivity data plays a key role in drug discovery and repurposing.
We propose a way to standardize and represent efficiently a very large dataset curated from multiple public sources.
arXiv Detail & Related papers (2024-01-30T17:06:25Z) - DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment
Design [61.48963555382729]
We propose DiscoBAX as a sample-efficient method for maximizing the rate of significant discoveries per experiment.
We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation.
arXiv Detail & Related papers (2023-12-07T06:05:39Z) - Pitfalls in Experiments with DNN4SE: An Analysis of the State of the
Practice [0.7614628596146599]
We conduct a mapping study, examining 194 experiments with techniques that rely on deep neural networks appearing in 55 papers published in premier software engineering venues.
Our study reveals that most of the experiments, including those that have received ACM artifact badges, have fundamental limitations that raise doubts about the reliability of their findings.
arXiv Detail & Related papers (2023-05-19T09:55:48Z) - GFlowNets for AI-Driven Scientific Discovery [74.27219800878304]
We present a new probabilistic machine learning framework called GFlowNets.
GFlowNets can be applied in the modeling, hypotheses generation and experimental design stages of the experimental science loop.
We argue that GFlowNets can become a valuable tool for AI-driven scientific discovery.
arXiv Detail & Related papers (2023-02-01T17:29:43Z) - Targeted active learning for probabilistic models [8.615625517708324]
A fundamental task in science is to design experiments that yield valuable insights about the system under study.
We present PDBAL, a targeted active learning method that adaptively designs experiments to maximize scientific utility.
arXiv Detail & Related papers (2022-10-21T17:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.