A large dataset curation and benchmark for drug target interaction
- URL: http://arxiv.org/abs/2401.17174v1
- Date: Tue, 30 Jan 2024 17:06:25 GMT
- Title: A large dataset curation and benchmark for drug target interaction
- Authors: Alex Golts, Vadim Ratner, Yoel Shoshan, Moshe Raboh, Sagi Polaczek,
Michal Ozery-Flato, Daniel Shats, Liam Hazan, Sivan Ravid, Efrat Hexter
- Abstract summary: Bioactivity data plays a key role in drug discovery and repurposing.
We propose a way to standardize and represent efficiently a very large dataset curated from multiple public sources.
- Score: 0.7699646945563469
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Bioactivity data plays a key role in drug discovery and repurposing. The
resource-demanding nature of \textit{in vitro} and \textit{in vivo}
experiments, as well as the recent advances in data-driven computational
biochemistry research, highlight the importance of \textit{in silico} drug
target interaction (DTI) prediction approaches. While numerous large public
bioactivity data sources exist, research in the field could benefit from better
standardization of existing data resources. At present, different research
works that share similar goals are often difficult to compare properly because
of different choices of data sources and train/validation/test split
strategies. Additionally, many works are based on small data subsets, leading
to results and insights of possible limited validity. In this paper we propose
a way to standardize and represent efficiently a very large dataset curated
from multiple public sources, split the data into train, validation and test
sets based on different meaningful strategies, and provide a concrete
evaluation protocol to accomplish a benchmark. We analyze the proposed data
curation, prove its usefulness and validate the proposed benchmark through
experimental studies based on an existing neural network model.
Related papers
- Causal Representation Learning from Multimodal Biological Observations [57.00712157758845]
We aim to develop flexible identification conditions for multimodal data.
We establish identifiability guarantees for each latent component, extending the subspace identification results from prior work.
Our key theoretical ingredient is the structural sparsity of the causal connections among distinct modalities.
arXiv Detail & Related papers (2024-11-10T16:40:27Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - Current Methods for Drug Property Prediction in the Real World [9.061842820405486]
Predicting drug properties is key in drug discovery to enable de-risking of assets before expensive clinical trials.
It remains unclear for practitioners which method or approach is most suitable, as different papers benchmark on different datasets and methods.
Our large-scale empirical study links together numerous earlier works on different datasets and methods.
We discover that the best method depends on the dataset, and that engineered features with classical ML methods often outperform deep learning.
arXiv Detail & Related papers (2023-07-25T17:50:05Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Drug Synergistic Combinations Predictions via Large-Scale Pre-Training
and Graph Structure Learning [82.93806087715507]
Drug combination therapy is a well-established strategy for disease treatment with better effectiveness and less safety degradation.
Deep learning models have emerged as an efficient way to discover synergistic combinations.
Our framework achieves state-of-the-art results in comparison with other deep learning-based methods.
arXiv Detail & Related papers (2023-01-14T15:07:43Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Deep neural networks approach to microbial colony detection -- a
comparative analysis [52.77024349608834]
This study investigates the performance of three deep learning approaches for object detection on the AGAR dataset.
The achieved results may serve as a benchmark for future experiments.
arXiv Detail & Related papers (2021-08-23T12:06:00Z) - DIVERSE: bayesian Data IntegratiVE learning for precise drug ResponSE
prediction [27.531532648298768]
DIVERSE is a framework to predict drug responses from data of cell lines, drugs, and gene interactions.
It integrates data sources systematically, in a step-wise manner, examining the importance of each added data set in turn.
It clearly outperformed five other methods including three state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T12:40:00Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.