BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination
Prediction
- URL: http://arxiv.org/abs/2302.06860v2
- Date: Thu, 16 Feb 2023 05:26:25 GMT
- Title: BLIAM: Literature-based Data Synthesis for Synergistic Drug Combination
Prediction
- Authors: Cai Yang, Addie Woicik, Hoifung Poon, Sheng Wang
- Abstract summary: BLIAM generates training data points that are interpretable and model-agnostic to downstream applications.
BLIAM can be further used to synthesize data points for novel drugs and cell lines that were not even measured in biomedical experiments.
- Score: 13.361489059744754
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models pre-trained on scientific literature corpora have
substantially advanced scientific discovery by offering high-quality feature
representations for downstream applications. However, these features are often
not interpretable, and thus can reveal limited insights to domain experts.
Instead of obtaining features from language models, we propose BLIAM, a
literature-based data synthesis approach to directly generate training data
points that are interpretable and model-agnostic to downstream applications.
The key idea of BLIAM is to create prompts using existing training data and
then use these prompts to synthesize new data points. BLIAM performs these two
steps iteratively as new data points will define more informative prompts and
new prompts will in turn synthesize more accurate data points. Notably,
literature-based data augmentation might introduce data leakage since labels of
test data points in downstream applications might have already been mentioned
in the language model corpus. To prevent such leakage, we introduce GDSC-combo,
a large-scale drug combination discovery dataset that was published after the
biomedical language model was trained. We found that BLIAM substantially
outperforms a non-augmented approach and manual prompting in this rigorous data
split setting. BLIAM can be further used to synthesize data points for novel
drugs and cell lines that were not even measured in biomedical experiments. In
addition to the promising prediction performance, the data points synthesized
by BLIAM are interpretable and model-agnostic, enabling in silico augmentation
for in vitro experiments.
Related papers
- Synthetic Data Generation with LLM for Improved Depression Prediction [5.508617844957542]
We propose a pipeline for Large Language Models to generate synthetic data to improve the performance of depression prediction models.
Not only was the synthetic data satisfactory in terms of fidelity and privacy-preserving metrics, it also balanced the distribution of severity in the training dataset.
arXiv Detail & Related papers (2024-11-26T18:31:14Z) - ChatEMG: Synthetic Data Generation to Control a Robotic Hand Orthosis for Stroke [2.396435395520969]
ChatEMG is an autoregressive generative model that can generate synthetic EMG signals conditioned on prompts.
This is the first time an intent classifier has been deployed for functional control of an orthosis by a stroke survivor.
arXiv Detail & Related papers (2024-06-17T22:04:44Z) - Synthetic Data from Diffusion Models Improve Drug Discovery Prediction [1.3686993145787065]
Data sparsity makes data curation difficult for researchers looking to answer key research questions.
We propose a novel diffusion GNN model Syngand capable of generating ligand and pharmacokinetic data end-to-end.
We show initial promising results on the efficacy of the Syngand-generated synthetic target property data on downstream regression tasks with AqSolDB, LD50, and hERG central.
arXiv Detail & Related papers (2024-05-06T19:09:37Z) - Synthetic Augmentation with Large-scale Unconditional Pre-training [4.162192894410251]
We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
arXiv Detail & Related papers (2023-08-08T03:34:04Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - PcMSP: A Dataset for Scientific Action Graphs Extraction from
Polycrystalline Materials Synthesis Procedure Text [1.9573380763700712]
This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations.
A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus.
We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
arXiv Detail & Related papers (2022-10-22T09:43:54Z) - Explaining Patterns in Data with Language Models via Interpretable
Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data.
iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions.
Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z) - Prompting to Distill: Boosting Data-Free Knowledge Distillation via
Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data.
We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors.
As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z) - Text Mining to Identify and Extract Novel Disease Treatments From
Unstructured Datasets [56.38623317907416]
We use Google Cloud to transcribe podcast episodes of an NPR radio show.
We then build a pipeline for systematically pre-processing the text.
Our model successfully identified that Omeprazole can help treat heartburn.
arXiv Detail & Related papers (2020-10-22T19:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.